I seem to have completely missed the waves of comment spam that have plagued the world of Movable Type in the past months (see Burningbird's recent post on the topic for links to posts of hers that provide a kind of timeline for the attacks). I'm apparently not popular enough to merit an attack, and that's fine with me.
Burningbird has spent a lot of time arguing against MT-Blacklist as spam protection, and I agree with her. For example, compare SpamAssassin to ifile. SpamAssassin, like MT-Blacklist, evaluates incoming messages based on static rules. Bayesian-algorithm filters like ifile and MT-Bayesian, on the other hand, compare incoming messages to other messages that have already been classified as spam or non-spam and calculate the probability of whether the message is spam or not. As I understand them, Bayesian filters are faster than complex, regex-filled, rule-based filters because their rules
consist only of the number of times a certain token has occurred in spam or in legitimate messages. Those numbers result in a ratio and in a probability of spammishness. (The seminal
article that everyone quotes is Paul Graham's A Plan for Spam; check it out for more details on his algorithm, which has been implemented many different ways by now.)
Bayesian filters work amazingly well. Their only weakness is that they require a fair amount of spam and non-spam to be properly trained, and the more distinct the vocabularies, the better the filter will work. This is a major weakness when you haven't yet got any spam; see the comments on James Seng's post about MT-Bayesian, where test-comments containing nothing but spammish words like viagra
are marked as 0% spam probability. My blog, had I any spam, would fail the same test.
Burningbird describes another comment-spam problem that disturbs me more than the actual content: DoS attacks. Spammers can send so many comments and/or pings that the attempt to process them will bring down a server. In this case, filtering schemes are counter-productive because they increase the amount of processing that must be done for each comment or ping. Even though I believe myself safe behind my Bayesian filter, willing to take some blows as I train it, I have a responsibility to Kevin (my ISP) not to make his server vulnerable to a melt-down. But the only thing I can do is not to allow comments or pings, and I would hate that. Even closing comments only on older posts would shut myself off from a fair number of legitimate comments. I have disabled the options that email me every time I have a comment or ping, but that may not be enough to stay a concerted attack.
I hope I stay unpopular enough for this never to become an issue.
Jeff says:
Futility is about right. If you really come under attack, and need to spare Kevin's server, the only option open would like be battening down the hatches and taking down the comments.
The problem with a DoS attack is that a computer with a big enough pipe can always flood you beyond capacity—spam filter or no. If not over-taxing the server, definitely over-taxing the server's connection, and—given enough time—fill your disk quota, denying other people's comments (and denying the service to others, resulting in a successful Denial of Service attack).
*shrug* Spammers are easily stopped with a filter, but the best treatment for a script kiddie still seems to be a baseball bat. Or, you just have to weather the storm, clean-up afterwards and decide whether or not to leave the comment system up.
Jeff
Phil Ringnalda says:
Actually, it's mostly not the connection flooding that hurts, it's the email notification and the page rebuilds. If every comment triggers a rebuild of a few index templates, monthly and category archives, and three individual templates, then when you get a few thousand quick comments, you wind up with too many rebuilds fighting over the same files and resources.
But the best defense is just to install Jacques Distler's patches to add a real throttle to comments and Trackback, and pick a number per day that's bigger than you'll ever get. You still get DoSed, but you decide when, and the only service that's denied is commenting, until the next morning when you wake up and clean out fifty or a hundred junk comments, instead of several thousand.
Laurabelle says:
Thanks, Phil. I hadn't known about those patches. As for throttling, I could limit it to 3 per day and never hit my limit. ;-)
Unfortunately I do have a number of different archives (category, individual, daily, monthly, even yearly!), so rebuilds do take a little while. I was very worried about my vulnerability to an attack, because I don't want to take down my friend Kevin's server. I feel much better now.