If you’re interested at all

If you’re interested at all in trying to filter your spam out of your email, you should go read the article “A Plan For Spam” by Paul Graham. The technique he describes is called Bayesian Filtering. It works by having the user flag their emails as spam or not-spam. Then the software learns what common tokens (a token is usually a single word, but can be just about anything, including information in the email headers) appear in both the spams and the non-spams. So if an email contains “lolita” or “ff0000″, the filter can be reasonably sure it’s a spam. If it contains both tokens, it can be very sure. And the opposite is true as well. So when the software scans over all the emails in your inbox and compares all the tokens, it ends up with a percentage likelyhood that an email is spam, and you can filter based on that.

Mozilla 1.3b has already implimented Bayesian Filtering, and there are free filtering programs available for other programs like Outlook as well.

But even if you don’t want to use one of these filters, the article is fascinating to read and understand a little more about how we can fight spam, and even how you could build your own spam filter just using keyword recognition, a feature every email client I’ve ever used has included (even Pine!).


2 Comments on “If you’re interested at all”

  1. David says:

    Of course Pine includes it, it came out of the UW! Go Huskies!!

    Okay I’m done.

    On a more serious note, I remember seeing something on /. a while ago about a similar project where there would be a database compiled online as people marked their email as spam or not, kept a running tally and then checked your new messages against this database to distinguish the spam from the not-spam. I don’t know if this is the same thing or slightly different than the one you’re talking about.

  2. scott says:

    Not sure which one you’re referring to, but there’s one like that called SpamAssassin. We’re actually running it on fojar. It’s good, but not as good. We got probably 95% of the spam, but we also got some false positives, because it uses the same keyword lists for everyone. The nice thing about this method is that everyone defines their own set of spam/non-spam tokens, so their filtering is customized to them. Very nice, and much more accurate, with fewer false positives.