Bots baby

Mon, 3 Mar 2003

At the time that I wrote my last entry about banning nasty user-agents, I tried blocking them with mod_rewrite, but I couldn't make it work. Later Mark Pilgrim wrote a blog entry about using mod_rewrite to block abusive sites and agents, and today I got around to trying out his examples. I still am not sure what the problem was before, but I'm happy that it's working. It gives me such a thrill to see that pesky DTS Agent get 403'ed. That darn bot retrieves only documents in my root directory, every single day. Different hosts, but it's still way too much.

Mark blocks user-agents fairly heavily, and I understand why he does it. I agree with his reasons, and I don't want my site or content abused either. However, I am concerned about the implications of denying access to certain user-agents. For some reason it feels like censorship to me, although that's absurd, because all I as webmaster am doing is restricting the purposes for which my site can be used. (Hmm, must write a formal site policy.) As the author of my content, I have the right to control its distribution.

But say I blocked MSIE, since I don't like how that browser handles my site? No, of course I wouldn't actually do that, but I suppose it wouldn't be much worse than many of these sites that say Designed for $BROWSER, and there are in fact many sites that deny access to browsers they don't like. I don't know, am I making any sense? Maybe I've absorbed too much of the information wants to be free mentality to be entirely comfortable with controlling my own information so closely.

Comments

Kevin says:

The thing with blocking bots is you need to find a balance. If you block blindly then it's a bit futile, but if you're conservative about it then it can be a huge help. As for censorship, with address harvesters all you're doing is stopping them helping themselves to information they're supposed to ask nicely for anyway.

Laurabelle says:

But that's, you know, me saying who can or cannot have access to my information, and as a proto-librarian that's contrary to the culture I've been breathing for the past six months. I think I'll feel better about it when I write a bit of a site policy saying things like this site is for reading, not harvesting or bulk downloading.

As for email harvesters, there's no point in their asking nicely, because I wouldn't give them addresses anyway. I guess they don't annoy me so much, though I like telling them explicitly that they're not welcome. What really bugs me is this one user-agent coming around all the time, I think from the same hosts, requesting the same two or three files. What the heck are they doing, if they're not crawling my whole site?

Post a comment











XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

OpenID: If you use OpenID, your comment will be approved automatically and will not be held for moderation.