Filtering Spam with SpamAssassin, SpamProbe, Fetchmail, and Gnus.


My mail spools on sanpietro. I use fetchmail to download to the mail spool on my local machine. Then I run Gnus in Emacs, which grabs my mail from the local spool and puts it in front of my eyes.

Spam filtering happens very early in this process. Both SpamAssassin and SpamProbe sit between the MDA and the mail spool on sanpietro. They evaluate each message, adding some headers to indicate spam status. Based on the headers, procmail either sends the mail to /dev/null, or allows it to continue on to the spool, with the new headers attached. Your local client can examine the spam headers of the mails that get through and make more sophisticated decisions than procmail did, if you want.


SpamAssassin's headers look like this:

   X-Spam-Status: Yes|No ...

SpamProbe's headers look like this:

   X-SpamProbe: SPAM|GOOD ...

Before you set up your ~/.procmailrc file, you need to train SpamProbe to recognize the difference between spam and ham. SpamProbe's big win is that it adapts to the spams/hams you in particular tend to receive... But only if you train it first. If you happen to have a Unix mbox file containing thousands of spams sitting around, and another one with thousands of hams, you're ready to go. If not, please feel free to use mine:

   sp$ mkdir ~/.spamprobe
   sp$ mkdir ~/ham-spam
   sp$ cd ~/ham-spam
   sp$ cp ~kfogel/ham-spam/spam*.gz .
   sp$ cp ~kfogel/ham-spam/ham*.gz .
   sp$ gunzip *.gz
   sp$ spamprobe spam spam*
   sp$ spamprobe good ham*
   sp$ ls -lh ~/.spamprobe
   total 73M
   -rw-------    1 jrandom  users           0 Apr 28 11:54 lock
   -rw-------    1 jrandom  users         73M Apr 29 17:02 sp_words

(SpamAssassin doesn't need any training — it just looks for common markers of spam. It generally makes conservative decisions; I don't think I've ever had a false positive with the default settings. But of course, it lets through more spam than SpamProbe does.)

Now that you've trained SpamProbe, you're ready to set up your ~/.procmailrc:

   ## Route all mail through SpamAssassin first.
   | /var/spamd/bin/spamc
   ## Put SpamAssassin-matched spam into the spiral file.
   * ^X-Spam-Status: Yes
   ## The next line of defense is SpamProbe.
   SCORE=| /usr/bin/spamprobe receive
   :0 fw
   | formail -I "X-SpamProbe: $SCORE"

   ## Put SpamProbe-matched spam into the spiral file.
   :0 a:
   * ^X-SpamProbe: SPAM

But wait, there's more...

Spams will still get through. In fact, the number that get through will go up over time, as the nature of spam slowly changes (last month it was breast enlargement, this month it's cheap meds) and drifts out of the range of what SpamProbe can detect based on the initial training session. So as spams get through, you need to save them, and every month or so upload them to sanpietro and run them through spamprobe again. The same is true for the hams: the nature of your good mails changes over time too. SpamProbe needs to know what both kinds look like, in order to reliably distinguish ham from spam.

Refreshing SpamProbe is easy:

   sp$ spamprobe spam my-spams.mbox
   sp$ spamprobe good my-hams.mbox

What's harder is actually having the mails to refresh it with. You'll probably want a single, convenient keystroke in your mailreader to save something in the spamfile or hamfile. Here's is what I use in Gnus — your mileage may vary, of course:

   (defun my-gnus-save-luncheon-meat (type)
     "Save current message as TYPE ('spam' or 'ham') for training SpamProbe."
     (let* ((file (cond
                   ((eq type 'spam) (expand-file-name "~/spam"))
                   ((eq type 'ham)  (expand-file-name "~/ham"))
                   (t               (error "Unrecognized type '%S'" type))))
             `(lambda (newsgroup headers &optional last-file) ,file))
            (gnus-prompt-before-saving nil)
            (gnus-expert-user t))
       (if (eq type 'spam)
           (gnus-summary-mark-as-expirable 1)
         (next-line 1))))
   (defun my-gnus-save-spam ()
     "Save this message as spam, for training SpamProbe."
     (my-gnus-save-luncheon-meat 'spam))
   (defun my-gnus-save-ham ()
     "Save this message as ham, for training SpamProbe."
     (my-gnus-save-luncheon-meat 'ham))
   (defun my-gnus-summary-mode-hook ()
     ;; Save spams and hams for training SpamProbe.
     (local-set-key "s" 'my-gnus-save-spam)
     (local-set-key "H" 'my-gnus-save-ham))
   (add-hook 'gnus-summary-mode-hook 'my-gnus-summary-mode-hook)

Now when you hit s in a Gnus summary buffer, it saves the message in ~/spam mbox file, then deletes and expires it from Gnus, all in one keystroke. If you hit H, it saves the message in ~/ham, but does nothing else, since you might not want to delete/expire a ham mail. Every so often, you should upload spam and ham to sanpietro and run the appropriate SpamProbe commands. You don't necessarily have to empty the mbox files after each refresh (though you might want to, to reduce your upload times to sanpietro). SpamProbe remembers a checksum of every message it has ever evaluated, so it won't accidentally count a message twice, even if the same message is encountered in multiple refresh sessions.

If you want to be cautious and see how SpamAssassin or SpamProbe is doing, just comment out the relevant sections in your ~/.procmailrc (the parts about putting stuff in /dev/null), and add rules to your Gnus 'nnmail-split-methods' to put the alleged spam in one place, where you can inspect it for false positives:

   (setq nnmail-split-methods
           ;; Spam for breakfast, spam for lunch, spam for dinner.

            "^X-Spam-Status: Yes")
            "^X-SpamProbe: SPAM")

           ;; the rest of your nnmail split stuff goes below here ;;

That's it. Questions, comments to kfogel.

(Back to Karl Fogel's home page.)