Spam Filtering with Bogofilter, Procmail and EXMH

Introduction

Everybody uses e-mail, but due to the advent of spam it can be a bane instead of the useful tool that it was meant to be. This document is intended to be a how-to for setting up spam filtering (and other basic filtering) with Procmail and EXMH.

Seven years ago, I wrote ifile, a general-purpose mail filtering tool. For some time, I've used it to organize my mail. My inbox is my "to-do" list, so whenever I'm done with an e-mail (or never wanted it in the first place), I let ifile filter it by pressing the `i' key. This worked fine for a long time: the extra 10-20 `i' buttons a day for spam wasn't a major pain. Well, now I'm getting so much spam that dealing with e-mail has become a real chore. So, I've decided to start using a spam filter. ifile wasn't designed for spam---it's lexing is far from optimal---so I looked elsewhere. After playing with a few command-line filters, I've settled on bogofilter. Building and installing is easy. The trick is getting it to work with your environment.

Training Bogofilter

One thing you absolutely must have in order to use bogofilter (or most any good spam filter for that matter) is a corpus of spam e-mails similar to those you normally receive. Just create a "spam" folder and put all your spam there. Or, get some from one of your friends. Once you get a few hundred, you should have enough for bogofilter to be effective. It's good to keep your spam collection current. I archive my old mail with a script, mailArchive.perl.

My first step was to train bogofilter. The following script grabs your MH information to determine mail directory, list of folders, etc.

It takes a minute or two to train on my 10,000 or so messages. Let me know if you have trouble getting bogoTrain.perl to work.

With bogofilter trained, you can test to make sure it learned something by running

      bogofilter -T < ~/Mail/foo/num
    
where "foo" is a mailbox and "num" is the number of a message. It will print 'H' for ham (non-spam), 'S' for spam and 'U' if it is unsure.

Procmail

The next step is to setup procmail to process your mail. It's easiest to have EXMH handle this. There are many benefits over filtering more directly (e.g. via a .forward file): filtering is always done (1) under your user id, (2) using your environment, and (3) on your machine. So, you don't need to tell procmail basic things like your HOME directory and you don't need to ask the sysadmins to install special software (like bogofilter) on the mail server. Anyway, here is a basic procmail configuration file:

Save this as ~/.procmailrc. Tell EXMH to use procmail by setting the "Incorporate Mail"->"Ways to Inc" preference to "presort" and the "Incorporate Mail"->"Method used to filter incoming mail" preference to "procmail". Once you've incorporated an e-mail with procmail hooked-up, you'll be able to see logging information in ~/procmail/log. To turn this off (log only errors), set LOGABSTRACT=no.

Now all of your e-mail should be going to your inbox. Next we want to write some rules that will filter the easy stuff like mailing lists and whatnot. This is where we'll need to write some procmail recipes to go before the default inbox recipe (procmail processes the recipes in order until one matches and the program). There are other good documents that deal with this, such as Using the procmail program and Using Procmail at Monash University. I'll just give this a brief treatment.

I like to have delivered to my inbox anything addressed to me (other than spam). This helps me follow discussions within a mailing list that I am involved in. The recipe for this is

      :0
      * ^TOmy@email.address.com
      inbox/.
    
"^TO" is a special token that checks To:, Cc: and other relevant headers. Substitute my@email.address.com with your own e-mail address. Adding this to your .procmailrc gives: procmailrc.

Most everyone subscribes to a mailing list or two or ten. And these things are easy to filter out because the mailing list address is always in the To: or Cc: header. If mailinglist@mailinglists.com is a mailing list you receive and you want such mail filtered to your "mailinglist" folder, the recipe is:

      :0
      * ^TOmailinglist@mailinglists.com
      mailinglist/.
    
Replace "mailinglist@mailinglists.com" with the mailing list e-mail address and replace "mailinglist" with the folder that you want messages filtered to. Adding this to your .procmailrc gives: procmailrc.

Spam Filtering

Okay, enough basic procmail stuff, time for the real fun: getting procmail and bogofilter to work together to filter out spam. The recipe in the bogofilter man page is close to what we need. But it assumes that you're using mbox-style folders. The revised recipe is:

      :0HB:
      * ? /usr/bin/bogofilter
      spam/.
    
Change the path if bogofilter is installed elsewhere. If bogofilter thinks the message is spam, it will be put into Mail/spam. Here's the final procmailrc.

The End

That's it! You can either have bogofilter update it's database for every e-mail (add the -u option to the above recipe) or periodically run bogoTrain.perl. I use the second option because it eliminates the (small) possibility of drift due to misclassifications. bogoTrain.perl does not overwrite your current bogofilter database until the new one is fully created (during learning, the bogofilter database is stored in $tmpDataDir).



Last modified: Fri Sep 17 13:35:17 EDT 2004