Mar 03 2009
…or annoyances with the DSPAM hash driver
I’ve been running DSPAM for a long time and the spam classification is great, unfortunately the maintenance tools are not as it turns out. This is the tale of why and more especially HOW I moved from the hash driver to the mysql driver as a backend for my DSPAM installation.
It all began a few days ago when the dovecot-antispam plugin all of a sudden refused to re-classify mails. This was all very weird because I was possible to re-classify mails from the command line, but the same mail/signature failed from the plugin.
After some debugging and messing with the plugin code I established that it successfully called dspam to re-classify the mail but dspam failed. Hm, maybe a permission problem then…nope. Same command with same permissions worked from the command line.
At this point I was really puzzled (and quite annoyed), the error made no sense. So I hacked the plugin to call dspam using a system call tracer (in this case truss which is shipped with FreeBSD) to produce a call trace of the dspam processes within the context of the plugin.
After comparing the outputs from a working case and a non-working case, one line stood out.
For those who don’t speak truss. This means that the mmap() system call (map a content of file into memory) failed because of an out of memory condition.
WHY!? Well it obviously hit a built-in mmap limit as the system had a lot of free memory. It was not vm.max_proc_mmap (tried bumping it, didn’t work). vm.kmem_size_max is large (64-bit system) and the uid had unlimited resources (ulimit). I’m still puzzled by this.
As the title reveals I’m running DSPAM with the hash driver, why? mostly because it was the default back-end driver in DSPAM when I installed it (and I was young and naive).
Wonder why it hit the limit in the first place, a quick look in /var/db/dspam/data reveals that the hash database with tokens for this particular user had grown to ~250MB.
Why so big? DSPAM comes with a cleaning utility called dspam_clean to clean out unused data, I’ve been running this nightly to keep stuff neat and tidy. Well, here comes the fun part.
It turns out that dspam_clean doesn’t actually do anything when you’re running the hash driver, it just looks like its working and takes ages to execute. There is no mentioning of this in any user documentation I found. With the hash driver you’re supposed to run cssclean…only that this program doesn’t work either – unless you patch it. I found a patch dubbed “cssclean is a redheaded stepchild” sent to the (now defunct, and moved to sourceforge) dspam-dev@ list at nuclearelephant.com that took care of this. The patch missed an include (at least on FreeBSD) that caused it to crash on 64-bit platforms (64/32-bit pointer truncation). The whole patch is included at the bottom of the page.
To summarize it, the maintenance tools are virtually non-existing when using the hash driver.
Probably a good opportunity to convert to the mysql driver instead (as I’ve been meaning to do for the last year or so).
Converting from the hash driver to mysql
There is a tool included with the DSPAM package called dspam_2sql, but of course this didn’t work properly either. Most important, it didn’t work with virtual users and just refused to run. Also timestamps on token data were ignored. I’ve hacked together a patch to fix theses issues, it’s available at the bottom of the page.
I first attempted to import non-cleaned data into the SQL database…big mistake. I aborted the progress when the token table had about 8 million entries (and weren’t even done with 1 user).
Quick recipe on how to do the conversion
Patch and compile the sources
The patched cssclean tools can be found in tools.hash_drv
Run tools.hash_drv/cssclean on all users (or users that receive a lot of mail). Backup your database (/var/db/dspam) first.
You can check the number of entries with cssstat
Create a MySQL database and user for dspam, I’ve called mine dspam. The table defintions can be found in tools.mysql_drv
You should now have a database with tables and a cleaned hash database, the only thing left is to convert it. dspam_2sql will output SQL statements, just feed these into MySQL.
For example, the conversion can be done like this
Beaware that it probably will take a lot of time.
You can convert one single user with
That’s it. Modify your dspam.conf to use the MySQL driver and restart dspam.
The MySQL driver IS slower than the hash driver. The processing time for a message with the hash driver was around 0.1-0.2 seconds on this hardware. The MySQL driver takes between 0.2-2 seconds for a message.
The slowdown is not a problem for me. Things like real database consistency (some of my users had duplicated tokens with the hash driver), better maintenance tools, easier manipulation of the data (and the benefits of my MySQL replication+backup configuration) makes it well worth it.