Who spiders my site? A list of bots I've encountered
In vaguely alphabetical order, not exhaustive, updated as and when I feel like it
- Cazoodle, their robot seems to be misbehaving when you first see it, but on further inspection it seems it's actually crawling your site spectacularly slowly- it does load the robots.txt file and your front page, and then loads other pages over the following days. So if you missed it the first time, it looks as though it's ignoring your robots.txt. Unfortunately this behaviour does seem to mean that if you change the robots.txt after it first turns up, it won't see the changes :P The claim that the robot and site are run by the University of Illinois seems to be borne out by the WHOIS data for their domain, so I'm reasonably satisfied they're not spammers or content rippers.
- "DotBot", from DotNetDotCom.org- reads the robots.txt, and the description there sounds vaguely legit, but I'm a bit put-out that they spidered half the same pages of my site again 4h after the first pass :P
- DiscoBot - not looked into yet, was on 29th december 08
- Googlebot: You may have heard of this one!! Spiders for a little-known web-search startup known as "Google"... Obviously there's a lot to be said about this, so it warrants its own section.
- MJ12bot: bit confusing, it seems as though this is some sort of volunteer project or something where individual users do the spidering or such, it's not that clear. I've been visited by bots claiming to be MJ12bot a few times, one time one noticably got robots.txt, then cheerfully ignored it, including numerous disallowed pages amongst the others it spidered. Their site has a bit of info saying about fraudulent MJ12bots but I didn't find it very helpful- seemed mostly along the lines that X and Y aren't them, but not how you can tell for sure whether Z is.
- WebDataCentre: This has visited a few times now, forget how exactly the robot introduces itself, but it's either written by illegal content-scrapers or by incompetents. The page it links by way of explanation is a vacuous sentence or two of hand-waving. Either way, it doesn't bother with robots.txt, and just downloads whatever it likes all within a very short time. There isn't one single IP to ban, it uses a different IP each visit (but each visit consists of a ton of hits from that one IP). There's a list of some of its IPs at this page and more info at this blog entry and this forum thread. Another page I couldn't access when I found it is here. An "obvious" method to get such rogue spiders is to automatically trap them based on lots of hits in a short time, without getting robots.txt, but this would catch any ordinary users whose browsers session-restore several pages from your site on startup, and bingo you've promptly excluded a user who happened to like your site a lot. It also has a "shutting the stable door after the horse has bolted" problem, by the time you block the spider it may have scraped half your site anyway.
What can you do about teh naughty bots?
Ummm... pass. If they have a consistent IP address you can AFAICT ban that, maybe via .htaccess. If they have a distinctive UA identification, you can
presumably ban that too, again maybe via .htaccess. Ever tried reading the docs for .htaccess? There's loads of it, and unfortunately it seems to be divided into: Reference docs for people who grok the overall structure somehow, and tutorials for setting up
http auth, the
complete botch of an access control mechanism that NOBODY should be using (there are far better methods if you have server-side scripting of some sort).
Even after finding a handy cheat-sheet thing for doing redirects and URL rewriting with .htaccess, which I use quite a lot, and despite being perfectly familiar with regexps and peculiar file formats and programming languages, I still find even that part of the .htaccess system spectacularly underexplained. So if you'd like to know how to block clients via .htaccess files, well that makes two of us.
Of course there are some bots that you wouldn't want to ban outright however naughty they are...
Googlebot
You'd think a big company like Google that purports to be all about openness and interoperability and open standards and ""Don't Be Evil"" would have a robot that behaves reasonably, wouldn't you? Unfortunately that robot can be uh, let's say a bit awkward. And it goes without saying I don't want to exclude my site from Google completely, I'm not a
complete idiot.
Permanent redirectsIt reads "Moved permanently" redirect results for pages as "will probably be back next week", and will try again, and again, and again, for many months at least, even if there are
no pages linking the resource. I know this because I redirected an old URL and fixed all its links back in May when the site had basically no outside links, just on the vague offchance someone had found the page in search and bookmarked it. Still it checks those pages, no matter how many times it's given a 301. For the most part this is just a nuisance but it seems it even keeps some of the old redirected URLs in search results, which frankly defeats the whole damn point of permanent redirects. In the end I recently (December) removed that redirect so those requests give a 404 (or 403 for the directory index), to see if that stops it, because it just bugs me. Time will tell.
Robots.txt
The robots.txt thing is a bit of an oddity. Everybody "knows" that the robots.txt file will stop honest robots/spiders from accessing the pages described, and should be effective in stopping them listing the pages too. Obviously it has no effect on robots who choose to ignore them, or worse still use it as a list of pages to check for "interesting" stuff. Well Googlebot like most legit robots, does check it, and does indeed stop spidering the pages you tell it to stay off. Problem solved, right? Mmmmno. If in fact Googlebot had already accessed the pages in question, and you only thought to block it afterwards, you will find that adding the pages to robots.txt only prevents it checking the pages again. It WILL NOT remove them from the Google Index, even if they claim otherwise (as of Dec 2008 at least). If they were listed before you changed your robots.txt, try another site:yoursitename.com type search and watch with joy as those pages are still given in results 4 or 5 months later, despite Googlebot checking your site every day or two.
What it seems you must do in those circumstances, is first put a META ROBOTS tag in the pages you want google to leave alone, and THEN only after Googlebot has seen those pages, add them to the robots.txt file (Note that this only applies if Googlebot's already seen them without the meta robots tag, it's not necessary if it's never spidered the page in the first place! To reiterate, robots.txt does stop it spidering the pages). The Google Help files do say that it seeing the meta robots noindex instruction on a page causes the page to be removed from the index. Of course, they also seem to say the same about the robots.txt file too... Again, time will tell if this kludgey fix actually gets the job done- it could well be that I'd put the meta tags in before the robots.txt file originally after all and just forgot the chronology, in which case clearly they're just doing whatever the hell they like because they can.
Note also, Google's interpretation of the robots.txt, according to this help page seems to be more based on the later IETF draft RFC, which as best I can tell, was never actually accepted as a standard ("Internet-Drafts are draft documents valid for a maximum of six months"), but it's not as though the original version was very official to begin with. Not necessarily a bad thing, even if most robots seem to assume the original version, but probably something to be aware of anyway.
Your META ROBOTS tag should look a bit like
<META NAME="ROBOTS" CONTENT="NOINDEX">
or
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
or variants on that, in order to have search engines exclude the page from their indices (set of possible search results). It must go in the <head> section of the page, like the title tag does. The "Nofollow" option is to do with whether linked pages are considered suitable for spidering too, but is of limited use if other pages don't have it. It's not case=sensitive FWIW. For more info see META tags page at Robotstxt.org.
Meanwhile of course, whilst you're waiting for Googlebot to get a clue, you still need it to hit the page in question again, even if that causes little annoyances like spurious session files. Presumably though very few apps/sites have more serious side-effects than that (at least if they're well-designed!). Further, there could conceivably still be other robots, who may obey the robots.txt but be oblivious to the meta robots tag. So it may be wise to have robots.txt exclude everything but Googlebot from those pages, just in case. Hooray! More work.
There is another option of course- if you have a Google account, you can log in and request some of your pages be excluded from the index. It would be terribly cynical of me to suggest that maybe Google deliberately made the robot misbehave in order to encourage webmasters to sign up and increase their user-base, I'm sure they'd never do something like that. A more likely answer is that out of the many millions of websites they spider, nobody else has ever spotted that basic problem and reported it.