Finding Footprints For Scrapebox
Okay – so you finally got Scrapebox (you don’t have it?? Get it NOW!).. Now you need to get large lists of targets for either your Scrapebox comment runs or for any other purpose or program. In a future post – I’m going to be writing about how I scraped new targets for Instant Social Anarchy – and the footprints I used..
But first, I thought it would be good to talk about HOW to find the footprints needed for programs like Scrapebox. What Scrapebox does is this: it takes your footprint and keywords, and combines them to create a search query (note: I’ll talk about this a little later, but for now – just know that that you don’t always need keywords, however you will get much larger lists if you use a large word list or keyword list). It then takes that search query and inputs it into Google and/or Yahoo, Bing, and AOL – and records all the sites/URLS that are listed in the results – up to 1000 URLs per query or keyword. Then you can remove the duplicate URLS or duplicate domains and go from there (export it for use in another program or check the page rank, or to be used with Scrapebox’s blog commenter).
So what IS a footprint? It’s something that a search engine can identify easily in your target sites – and it’s common amongst any site that uses the particular framework or CMS that you are targeting. It is important to find a footprint that is found in most, if not ALL the sites that use that particular framework or CMS – AND that it’s NOT found in unrelated sites or sites that are NOT using that framework or CMS. In other words – you want to find a phrase or some aspect of the site’s code that is only in that type of site…
Let’s take the most basic footprints you can use… The “Powered By” footprint. This is a common attribute for sites that use a common framework or CMS. You’ll usually find it at the bottom of any site that uses a particular framework, and it usually links back to the home page of the framework or CMS, however sometimes it’s just regular text. Doesn’t matter though… What you are looking for is the text that’s particular for your target.
For example – if you want to find “pLog” sites.. a footprint you can use is:
“Add comment” “power by plog”
If you enter that into Google – you’ll get a list of pLog sites because it’s now showing all the sites that have BOTH those phrases that I put into quotes.. You might be wondering why I used more than just “power by plog”. Well, it’s because not ALL pLog sites accept comments. It’s an option that the site owner can turn off. Since I’ll be spamming them – I don’t want a list with sites that I can’t utilize – so I make sure it also says “Add comment” somewhere in the site.
You can mix, match, and combine queries. You can make them as complex or as big as you want really… Generally, the more complex the footprint is, the less amount of sites you’ll get back. But you don’t want useless sites either – so make sure it’s general enough to get a large list but not so general that you’ll get a list with sites you can’t utilize. That will either slow down your program (the program that you are importing this list into) or make it not work as well…
Let’s look at another footprint:
“comment” “Powered By ExpressionEngine” –forum
This is to find Expression Engine sites. So I used the popular “Powered By …” phrase, plus the “comment” phrase, but ALSO I added: -forum. What that will do is find all the expression engine sites that do NOT have the word “forum” in them. Remember, you can use various operators and expressions in your queries and these will help you out tremendously. The only problem with certain operators is that they will not always work with all the search engines. Many Google operators or expressions will not work with Bing or AOL… So sometimes I’ll have two sets of footprints – ones that work with Google, and ones that are more universal and will work with all (or most) of the search engines. I’ll then scrape the Google ones first (making sure Scrapebox is set to only scrape Google), then I’ll do the other 3 search engines with the other footprints that aren’t unique to Google’s parameters, and then I’ll combine those lists and remove the duplicates.
So how do you find these phrases? What I will do is take at least THREE sites that I KNOW are using the framework or CMS that I am targeting, and open them in my internet browser. I’ll then view the SOURCE CODE. This is important because you want to view the page as the search engines do – which is usually just plain text or source code. It’s important too, because some sites don’t make it very easy to find the common traits – because they’ll use images or hide things with CSS or the page is just too big or complex to find little bits of common text. It’s just much easier to look at the source code.
So what you need to find are unique phrases or bits of code that are found in ALL three (or however many) sites you have open.. And make sure it’s specific to that type of site – meaning it’s not found in other sites..
The first thing you want to look for is the name of the framework or CMS you are targeting. If that’s not present anywhere (check all your sites – sometimes the developers of the particular site remove any instance of the framework or CMS – although this isn’t the norm, so even if one of them does it, the majority won’t.) – then you need to start looking for other common traits.. Something in the navigation or the footer that is common to all the sites that use that framework…
Something to remember: You are looking for unique phrases, but this does not mean you can’t use common words or phrases.. Sometimes it’s the ORDER of the phrase(s) or the fact that all the sites have ALL the parts/words or phrases whereas other sites might have part of it, but they won’t have all of it.. So you can then make sure in your query that it’s looking for sites that have ALL the words, and not just SOME of the words..
For example – if you want to find Blog2Evolution sites.. one of the footprints you can use is:
“Your email address will not be revealed on this site.” +”Your URL will be displayed.”
Those phrases/words by themselves aren’t very unique, however, with that query, you are telling the search engines that those entire phrases are BOTH required to show up on the page somewhere… That query will then give you a list of Blog2Evolution sites.
You always want to test your footprints in Google.com – just to make sure they are pulling the right sites. You don’t want to waste a lot of your time, proxies, and bandwidth, scraping a big list of sites that are worthless (remember – a footprint that isn’t pulling the sites you want – it will STILL pull sometimes LARGE lists of sites, and if they aren’t the target you are looking for – it’s a big waste of time and energy…). So once I find a good footprint (or what I think is a good footprint), I’ll just copy and paste it into Google’s search field and look at the results. If the search results are showing the type of sites you are looking for, you’re good to go. Just make sure you look at a few pages, and also make sure there ARE multiple pages (if it only finds a few results, then your footprint is obviously TOO specific). Also – many good footprints won’t be 100%. There WILL be some sites that show up, that you won’t be able to use, or aren’t the right type of site. That’s okay.. As long as it’s not the majority, you should still be able to use the list (a good program will just skip the sites that aren’t valid, or it won’t spend too much time with them).
Once you have your footprint(s), you can then scrape the search engines to get a big list of targets. I generally will use public proxies (harvested and tested with Scrapebox) when scraping search engines (if I’m blog commenting – I use my own private proxies). The # of proxies depends on how many queries you will be doing. Sometimes you can get away with only using 30 or so, but generally I like to have at least 75.. The more the better.. 150 or more is optimal.
So, going back to what I mentioned earlier about Scrapebox needing keywords with the footprint… If you are entering one footprint at a time (in other words, you’re not using the multiple footprint mode) – you don’t need a keyword. However, you do need to remember a few key points:
1. Scrapebox will use ONE connection (meaning ONE proxy) per footprint/keyword combination. That means if you don’t have any keywords listed – it’s not going to scrape with multiple threads, so it will be slower. This also means that it’s just using one proxy – which isn’t good because if you are using public proxies, and the particular proxy that it chose for this thread wasn’t working at that time, or is banned, you will get bad results, if any.
2. The search engines typically will only show you 1,000 results per query. So if you are using a footprint with no keywords, that’s only one query and you will get at MOST, only 1000 results. That’s not a lot, considering many will be duplicates or numerous URLs from the same domain.
3. Since you will only get 1,000 results, you probably aren’t getting all the sites that have that footprint. Using large keyword lists or word lists will get you much better/longer results…
That being said – I do it both ways (wow, that sounded wrong).. I do one run with just the footprint and no keywords (and I’ll either use NO proxy for this, or I’ll use one of my private proxies if my proxy gets banned), and then I’ll do a run with a large word list. The word list can either be targeted – meaning you want to find sites related to a certain niche (however your list will be MUCH smaller this way and will probably have 90% duplicates since targeted keyword lists by nature have similar words that will output similar results) – or you can use a very general word list (the more general, the better). However, you don’t want a normal keyword list like you are used to. You want general, UNRELATED words (unrelated from other words in your list). You basically want the search engines to find as many sites as possible – so if you use similar words, they will find similar sites – so your list might be big – but once you de-dupe it, it will be MUCH smaller (probably by 95% at least). The reason I also do a run with no keywords – is that if your target isn’t related to any of the words you have in your keyword list, then you won’t get a lot of the sites that are using that framework or CMS. So to get the best target list possible, I will do a run with no keywords (just the footprint) and combine those results with the run I did with my large word list.
When using a keyword or word list, the trick is to use words like “computers, electronics, women, man, sports, college, education, science”… You get what I mean.. General keywords.. Broad keywords.. Words that will output many more results than very specific keywords. Another trick is to use a “common word list”. You can find these all over the web. The only problem with these is that many are words that aren’t related or used with your target framework or CMS, and/or they are in multiple languages, or they are in alphabetical order and have very similar words (in other words, you don’t want to use a dictionary).
Here’s a list of words (mostly nouns) that you can use: wordlist1_nouns.txt.
Remember – that word list is around 1K words – that’s a LOT of search queries.. Even when using hundreds of proxies, many will get banned. This is one reason I don’t do multiple footprints at once… You’ll start getting bad or no results
due to the proxies being banned. Even when doing one footprint at a time, with a large word list, you can still get banned – especially if you are running one footprint after another. So you might have to harvest and test new proxies in between runs, or unlock the proxies using Scrapebox and Decaptcher. Another thing you can try is to make the list smaller. Cut the list into 3 separate lists and do 3 different runs.
So, to recap:
- Load at least 3 websites that you know are using the CMS or framework you are targeting, in your browser.
- View the source code to find similar phrases or common aspects of the sites in question, and make sure they are in all the sites you are looking at. Also, make sure they aren’t too general (if it’s too general, you’ll get a lot of unrelated sites).
- Test the footprint (query) in Google.com and make sure it’s providing the targets you want.
- Load up Scrapebox and add the footprint/query into the program (make sure you have loaded good/tested proxies).
- Don’t import any keywords and start harvesting your list.
- Remove duplicates (depending on what you are using this list for – you’ll probably want to remove duplicate DOMAINS).
- Import your word list into Scrapebox and re-harvest your list.
- Remove the duplicates again. That’s it!
I hope this has been helpful. As always, let me know if you have any questions and… Have Phun!