robot visitors

(reload) (page class:public)

Who spiders my site? A list of bots I've encountered


In vaguely alphabetical order, not exhaustive, updated as and when I feel like it

What can you do about teh naughty bots?


Ummm... pass. If they have a consistent IP address you can AFAICT ban that, maybe via .htaccess. If they have a distinctive UA identification, you can presumably ban that too, again maybe via .htaccess. Ever tried reading the docs for .htaccess? There's loads of it, and unfortunately it seems to be divided into: Reference docs for people who grok the overall structure somehow, and tutorials for setting up http auth, the complete botch of an access control mechanism that NOBODY should be using (there are far better methods if you have server-side scripting of some sort).

Even after finding a handy cheat-sheet thing for doing redirects and URL rewriting with .htaccess, which I use quite a lot, and despite being perfectly familiar with regexps and peculiar file formats and programming languages, I still find even that part of the .htaccess system spectacularly underexplained. So if you'd like to know how to block clients via .htaccess files, well that makes two of us.

Of course there are some bots that you wouldn't want to ban outright however naughty they are...

Googlebot


You'd think a big company like Google that purports to be all about openness and interoperability and open standards and ""Don't Be Evil"" would have a robot that behaves reasonably, wouldn't you? Unfortunately that robot can be uh, let's say a bit awkward. And it goes without saying I don't want to exclude my site from Google completely, I'm not a complete idiot.
Permanent redirects
It reads "Moved permanently" redirect results for pages as "will probably be back next week", and will try again, and again, and again, for many months at least, even if there are no pages linking the resource. I know this because I redirected an old URL and fixed all its links back in May when the site had basically no outside links, just on the vague offchance someone had found the page in search and bookmarked it. Still it checks those pages, no matter how many times it's given a 301. For the most part this is just a nuisance but it seems it even keeps some of the old redirected URLs in search results, which frankly defeats the whole damn point of permanent redirects. In the end I recently (December) removed that redirect so those requests give a 404 (or 403 for the directory index), to see if that stops it, because it just bugs me. Time will tell.

Robots.txt
The robots.txt thing is a bit of an oddity. Everybody "knows" that the robots.txt file will stop honest robots/spiders from accessing the pages described, and should be effective in stopping them listing the pages too. Obviously it has no effect on robots who choose to ignore them, or worse still use it as a list of pages to check for "interesting" stuff. Well Googlebot like most legit robots, does check it, and does indeed stop spidering the pages you tell it to stay off. Problem solved, right? Mmmmno. If in fact Googlebot had already accessed the pages in question, and you only thought to block it afterwards, you will find that adding the pages to robots.txt only prevents it checking the pages again. It WILL NOT remove them from the Google Index, even if they claim otherwise (as of Dec 2008 at least). If they were listed before you changed your robots.txt, try another site:yoursitename.com type search and watch with joy as those pages are still given in results 4 or 5 months later, despite Googlebot checking your site every day or two.

What it seems you must do in those circumstances, is first put a META ROBOTS tag in the pages you want google to leave alone, and THEN only after Googlebot has seen those pages, add them to the robots.txt file (Note that this only applies if Googlebot's already seen them without the meta robots tag, it's not necessary if it's never spidered the page in the first place! To reiterate, robots.txt does stop it spidering the pages). The Google Help files do say that it seeing the meta robots noindex instruction on a page causes the page to be removed from the index. Of course, they also seem to say the same about the robots.txt file too... Again, time will tell if this kludgey fix actually gets the job done- it could well be that I'd put the meta tags in before the robots.txt file originally after all and just forgot the chronology, in which case clearly they're just doing whatever the hell they like because they can.

Note also, Google's interpretation of the robots.txt, according to this help page seems to be more based on the later IETF draft RFC, which as best I can tell, was never actually accepted as a standard ("Internet-Drafts are draft documents valid for a maximum of six months"), but it's not as though the original version was very official to begin with. Not necessarily a bad thing, even if most robots seem to assume the original version, but probably something to be aware of anyway.

Your META ROBOTS tag should look a bit like
<META NAME="ROBOTS" CONTENT="NOINDEX">
or
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
or variants on that, in order to have search engines exclude the page from their indices (set of possible search results). It must go in the <head> section of the page, like the title tag does. The "Nofollow" option is to do with whether linked pages are considered suitable for spidering too, but is of limited use if other pages don't have it. It's not case=sensitive FWIW. For more info see META tags page at Robotstxt.org.

Meanwhile of course, whilst you're waiting for Googlebot to get a clue, you still need it to hit the page in question again, even if that causes little annoyances like spurious session files. Presumably though very few apps/sites have more serious side-effects than that (at least if they're well-designed!). Further, there could conceivably still be other robots, who may obey the robots.txt but be oblivious to the meta robots tag. So it may be wise to have robots.txt exclude everything but Googlebot from those pages, just in case. Hooray! More work.

There is another option of course- if you have a Google account, you can log in and request some of your pages be excluded from the index. It would be terribly cynical of me to suggest that maybe Google deliberately made the robot misbehave in order to encourage webmasters to sign up and increase their user-base, I'm sure they'd never do something like that. A more likely answer is that out of the many millions of websites they spider, nobody else has ever spotted that basic problem and reported it.



Page source

Warning:Only I can edit Mwuki!