Archive for the 'spamassassin' Category

How to deal with joe-jobs and massive bounce storms

Wednesday, January 10th, 2007

As I’ve noted before, we
still have a major problem with sites generating bounce/backscatter storms in
response to forged mail — spam, viruses, and so on. These sites have a broken
mail configuration, but there are still thousands out there — it’s very hard
to fix an old mail setup to avoid this issue. As a result, a single spam
run can concentrate the volume of response bounces in a Smurf-attack-style volume
multiplication
, and this acts as a serious denial of service; I’ve
regularly had serious load problems and backlogs on my MX, due solely to
these bounces.

However, I think I’ve now solved it, with only a little loss of functionality.
Here’s how I did it, using Postfix and SpamAssassin.

Firstly, note that if you adopt this, you will lose functionality.
Third party sites will not be able to generate bounces which are sent
back to senders via your MX — except during the SMTP transaction.

However, if a message delivery attempt is run from your MX, and it is bounced
by the host during that SMTP transaction, this bounce message will still be
preserved. This is good, since this is basically the only bounce scenario that
can be recommended, or expected to work, in modern SMTP.

Also, a small subset of third-party bounce messages will still get past, and be
delivered — the ones that are not in the RFC-3464 bounce format generated
by modern MTAs
, but that include your outbound relays in the quoted header.
The idea here is that “good bounces”, such as messages from mailing lists
warning that your mails were moderated, will still be safe.

OK, the details:

In Postfix

Ideally, we could do this entirely outside Postfix — but in my experience,
the volume (amplified by the Smurf attack effects) is such that these
need to be rejected as soon as possible, during the SMTP transaction.

In my Postfix configuration, on the machine that acts as MX for my domains –
edit ‘/etc/postfix/header_checks’ and add these lines:

/^Return-Path: <>/                              REJECT no third-party DSNs
/^From:.*MAILER-DAEMON/                         REJECT no third-party DSNs
/^Content-Type: multipart\/report; /            REJECT no third-party DSNs
/^Content-Type: message\/delivery-status; /     REJECT no third-party DSNs

Edit ‘/etc/postfix/main.cf’, and ensure it contains this line:

header_checks = regexp:/etc/postfix/header_checks

Now restart Postfix.

In SpamAssassin

Install the
Virus-bounce ruleset
. This will catch challenge-response mails, “out of
office” noise, “virus scanner detected blah” crap, and bounce mails generated
by really broken groupware MTAs — the stuff that gets past the Postfix
front-line.

Once you’ve done these two things, that deals with almost all the forged-bounce
load, at what I think is a reasonable cost. Comments welcome…

Tags:

This post was written by Justin, source: How to deal with joe-jobs and massive bounce storms

Our first physical award

Thursday, December 7th, 2006

W00t!

Tags:

This post was written by Justin, source: Our first physical award

SpamAssassin as an EC2 service

Thursday, November 30th, 2006

I had a bit of an epiphany while chatting to Antoin
about the qpsmtpd/EC2 idea. Craig
had the same thoughts
.

Here’s the thing — there’s actually no need to offload the SMTP part at all.
That stuff is tricky, since you’ve got to build in a lot of fault tolerance,
quality-of-service, uptime, etc. to ensure that the MX really is reachable.
Since an EC2 instance will lose its “disks” once rebooted/shut down, you need
to store your queues in Amazon S3 — which has differing filesystem semantics
from good old POSIX — so things get quite a bit hairier. On top of that, it
requires a little RFC-breakage; there are issues with using CNAMEs in MX
records, reportedly.

However, if we offload just the spamd part, it becomes a whole lot simpler. The
SPAMD
protocol

will work fine across long distances, securely, with SSL encryption active,
and SpamAssassin will work fine as a filtering system in an entirely stateless
mode, with no persistent-across-reboots storage. (What about the
persistent-storage aspects of spamd operation? There’s just the
auto-whitelist, which can be easily ignored, and I haven’t trained a Bayes
database in 2 years, so I doubt I’ll need that either ;)

If the spamd server is down or uncontactable, spamc will handle this and retry
with another server, or eventually give up and pass the message through, safely
intact (though unscanned).

Given that there’s a cool third-party ClamAV
plugin
now available for
SpamAssassin, this system can offload the virus-scanning work, too.

So here’s the new plan: run the MTA, MX, and the super-lean “spamc” client on
the normal MX machine — and offload the “spamd” work to one or more EC2
machines.

Basically, there would be a CNAME record in DNS, listing the dynamic
DNS names of the EC2 spamd instances. Then, spamc is set to point at that
CNAME as the spamd host to use. As EC2 instances are started/removed,
they are added/removed from that CNAME list and spamc will automatically
keep up.

Pricing is reasonably affordable — don’t send over-large messages to the EC2
spamd; rate-limit total incoming SMTP traffic in the MTA; and use the SPAMD
protocol
’s REPORT verb to reduce the bandwidth
consumption of mails in transit by ensuring that the mail messages are only
transmitted one-way, MX-to-EC2, instead of both MX-to-EC2 and EC2-to-MX.
That will keep the bandwidth pricing down.

Recent figures indicate that I got about 90MB of mail per day, at peak, over
the past weekend (which nearly DOS’d my server and caused some firefighting) –
68MB of spam, and 13MB of blowback. At 20 cents per GB, that’s 1.8 cents per
day for traffic. Plus the $0.10 per instance hour, that’s $2.42 per day to run
a single EC2 instance to handle DDOS spikes. Of course, that can be shut down
what load is low.

Yep, this is looking very promising. Now when are Amazon going to let me
onto the beta program for EC2?…

Tags:

This post was written by Justin, source: SpamAssassin as an EC2 service

Bleadperl regexp optimization vs SA

Thursday, November 16th, 2006

I’ve been looking some more into recent new features added to bleadperl by demerphq, such as Aho-Corasick trie
matching, and how we can effectively support this in SpamAssassin. Here’s the
state of play.

These are the “base
strings”
extracted from the SpamAssassin SVN trunk body ruleset (ignore the
odd mangled UTF-8 char in here, it’s suffering from cut-and-paste breakage).
A “base string” is a simplified subset of the regular expression; specifically,
these are the cases where the “base strings” of the rule are simpler than the
full perl regular expression language, and therefore amenable to fast parallel
string matching algorithms.

The base strings appear in that file as “r” lines, like so:

r I am currently out of the office:__BOUNCE_OOO_3 __DOS_COMING_TO_YOUR_PLACE
r I drive a:__DOS_I_DRIVE_A
r I might be c:__DOS_COMING_TO_YOUR_PLACE
r I might c:__DOS_COMING_TO_YOUR_PLACE

The base string is the part after “r” and before the “:”; after that, the rule
names appear.

Now, here are some limitations that make this less easy:

  • One string to many rules: each one of those strings corresponds to one or
    more SpamAssassin rules.

  • One rule to many strings: each rule may correspond to one or more of those
    strings. So it’s not a one-to-one correspondence either way.

  • No anchors: the strings may match anywhere inside the line, similar to
    ("foo bar baz" =~ /bar/).

  • Multiple rules can fire on the same line: each line can cause multiple
    rules to fire on different parts of its text.

  • Subsumption is not permitted: the base-string extractor plugin has already
    established cases where subsumption takes place. Each string will not
    subsume another string; so a match of the string “food” against the strings
    “food” and “foo” should just fire on “food”, not on “foo”.

  • Overlapping is permitted: on the other hand, overlapping is fine; “foobar”
    matched against “foo” and “oobar” should fire on both base strings. (The
    above two are basically for re2c compatibility. This is the main reason the
    strings are so simple, with no RE metachars — so that this is possible,
    since re2c is limited in this way.)

  • Most rules are more complex: most of the ruleset — as you can see from the
    ‘orig’ lines in that file — are more complex than the base string alone. So
    this means that a base string match often needs to be followed by a
    “verification” match using the full regexp.

Now, the problem is to iterate through each line of the (base64-decoded,
encoding-decoded, HTML-decoded, whitespace-simplified) “body text” of a mail
message, with each paragraph appearing as a single “line”, and run all those
base strings in parallel, identifying the rule names that then need to
be run.

This is turning out to be quite tricky with the bleadperl trie code.

For example, if we have 3 base strings, as follows:

  hello:RULE_HELLO
  hi:RULE_HI
  foo:RULE_FOO

At first, it appears that we could use the pattern itself as a key
into a lookup table to determine the pattern that fired:

  %base_to_rulename_lookup = (
    'hello' => ['RULE_HELLO'],
    'hi' => ['RULE_HI'],
    'foo' => ['RULE_FOO']
  );

  if ($line =~ m{(hello|hi|foo)}) {
    $rule_fired = $base_to_rulename_lookup{$1};
  }

However, that will fail in the face of the string “hi foo!”, since
only one of the bases will be returned as $1, whereas we want
to know about both “RULE_HI” and “RULE_FOO”.

m//gc might help:

  %base_to_rulename_lookup = (
    'hello' => ['RULE_HELLO'],
    'hi' => ['RULE_HI'],
    'foo' => ['RULE_FOO']
  );

  while ($line =~ m{(hello|hi|foo)}gc) {
    $rule_fired = $base_to_rulename_lookup{$1};
  }

That works pretty well, but not if two patterns overlap: /abc/
and /bcd/, matching on the string “abcd”, for example, will fire
only on “abc”, and miss the “bcd” hit.

Given this, it appears the only option is to run the trie match, and then
iterate on all the regexps for the base strings it contains:

  if ($line =~ m{hello|hi|foo}) {
    $line =~ /hello/ and rule_fired("HELLO");
    $line =~ /hi/ and rule_fired("HI");
    $line =~ /foo/ and rule_fired("FOO");
  }

Obviously, that doesn’t provide much of a speedup — in fact,
so far, I’ve been unable to get any at all out of this method. :(

This can be optimized a little by breaking into multiple trie/match
sets:

  if ($line =~ m{hello|hi}) {
    $line =~ /hello/ and rule_fired("HELLO");
    $line =~ /hi/ and rule_fired("HI");
    ...
  }
  if ($line =~ m{foo|bar}) {
    $line =~ /foo/ and rule_fired("FOO");
    $line =~ /bar/ and rule_fired("BAR");
    ...
  }

But still, the reduction in regexp OPs vs the addition of logic
OPs to do this, result in an overall slowdown, even given
the faster trie-based REs.

Suggestions, anyone?

(by the way, if you’re curious, the current code is
here in SVN.)

Tags:

This post was written by Justin, source: Bleadperl regexp optimization vs SA

PhishTank now supported by SpamAssassin

Thursday, October 19th, 2006

Thanks to Jeff Chan of the SURBL project, data from PhishTank is now being included in the SURBL ‘ph’ anti-phishing list.

This means it’s now supported by all existing versions of SpamAssassin from 3.0.0 onwards. Good news, and thanks to Jeff and the OpenDNS guys!

Tags:

This post was written by Justin, source: PhishTank now supported by SpamAssassin

Some p0f Data From Craig

Tuesday, October 3rd, 2006

Regarding the use of p0f, passive OS fingerprinting, as an anti-spam measure — on top of this analysis which I linked to a few weeks back, one of the emeritus SA guys, Craig Hughes, sends over some p0f experiences. Handily, this includes a more detailed breakdown by OS release:

I’ve been using the SA p0f plugin for nearly a month or so now both on
gumstix’s web server and my hughes-family.org
server, and it actually looks like it could be pretty useful. So far I’ve
just been scoring 0.001 for each OS to collect data, but here’s the results
amavis has logged:

This breakdown shows what %age of the stuff coming in via OS xyz is spam or
ham. ie 84.6% of all mail received from Windows-2000 is spam, 14.9% is ham
(the rest is viruses). The first numeric column is number of messages of
each type. Statistics are only since the last time amavis restarted:

On his home machine (comcast cable modem connection) :

spam.byOS.Windows-2000 438 1/h 84.6 %
spam.byOS.Linux 417 1/h 18.3 %
spam.byOS.Windows-XP 265 1/h 97.8 %
spam.byOS.UNKNOWN 135 0/h 55.1 %
spam.byOS.Windows-XP/2000 24 0/h 100.0 %
spam.byOS.Novell 5 0/h 100.0 %
spam.byOS.Windows-98 3 0/h 60.0 %
spam.byOS.Windows-2003 2 0/h 66.7 %
spam.byOS.FreeBSD 2 0/h 1.3 %
spam.byOS.Solaris 1 0/h 1.8 %
spam.byOS.Windows-SP3 1 0/h 100.0 %
ham.byOS.Linux 1851 6/h 81.2 %
ham.byOS.FreeBSD 143 0/h 96.0 %
ham.byOS.UNKNOWN 102 0/h 41.6 %
ham.byOS.Windows-2000 77 0/h 14.9 %
ham.byOS.Solaris 56 0/h 98.2 %
ham.byOS.NetCache 6 0/h 100.0 %
ham.byOS.Windows-XP 6 0/h 2.2 %
ham.byOS.Tru64 2 0/h 100.0 %
ham.byOS.AIX 2 0/h 100.0 %
ham.byOS.Windows-98 2 0/h 40.0 %
ham.byOS.Windows-2003 1 0/h 33.3 %

On gumstix.com (hosted at some provider in Texas):

spam.byOS.Windows-2000 401 1/h 58.4 %
spam.byOS.Windows-XP 131 0/h 92.9 %
spam.byOS.UNKNOWN 64 0/h 18.7 %
spam.byOS.Windows-XP/2000 29 0/h 96.7 %
spam.byOS.FreeBSD 11 0/h 4.1 %
spam.byOS.Linux 11 0/h 0.5 %
spam.byOS.Windows-98 6 0/h 85.7 %
spam.byOS.Solaris 4 0/h 3.3 %
spam.byOS.Windows-SP3 2 0/h 100.0 %
ham.byOS.Linux 1983 4/h 97.6 %
ham.byOS.UNKNOWN 277 0/h 80.8 %
ham.byOS.Windows-2000 271 0/h 39.4 %
ham.byOS.FreeBSD 253 0/h 93.7 %
ham.byOS.Solaris 116 0/h 96.7 %
ham.byOS.NetCache 40 0/h 100.0 %
ham.byOS.Windows-XP 9 0/h 6.4 %
ham.byOS.Windows-NT 7 0/h 70.0 %
ham.byOS.Novell 3 0/h 100.0 %
ham.byOS.Windows-XP/2000 1 0/h 3.3 %
ham.byOS.Windows-98 1 0/h 14.3 %
ham.byOS.Windows-2003 1 0/h 100.0 %

my home machine has a lot more relayed mail coming to it (all my
various craig@* email addresses forward into there) which is probably
why the linux spam rate is higher there — the relaying machines are
probably running linux and forwarding spam through.

Interesting figures — but I’m still not-convinced that the correlation
is quite high enough to form a good enough basis for solid anti-spam rules;
reliable rules in the SpamAssassin core typically have over 95% accuracy at
differentiating ham from spam (at least when we first check them in).

Update: it’s a natural for use as a Bayes token, though. The way amavisd-new implements p0f support is perfect for this use.

BTW, my guess is that many of the spam hits for “linux” are due to things like Netgear/Linksys routers, running embedded linuces. No evidence, just guessing ;)

Tags:

This post was written by Justin, source: Some p0f Data From Craig

Linus on Bayesian filtering

Monday, October 2nd, 2006

Linus Torvalds, in a post to linux-kernel today:

I’m sorry, but spam-filtering is simply harder than the bayesian word-count
weenies think it is. I even used to know something about bayesian
filtering, since it was one of the projects I worked on at uni, and dammit,
it’s not a good approach, as shown by the fact that it’s trivial to get
around.

I don’t know why people got so excited about the whole bayesian thing. It’s
fine as one small clause in a bigger framework of deciding spam, but it’s
totally inappropriate for a “yes/no” kind of decision on its own.

If you want a yes/no kind of thing, do it on real hard issues, like not
accepting email from machines that aren’t registered MX gateways. Sure, that
will mean that people who just set up their local sendmail thing and connect
directly to port 25 will just not be able to email, but let’s face it, that’s
why we have ISP’s and DNS in the first place.

But don’t do it purely on some bogus word analysis.

If you want to do word analysis, use it like SpamAssassin does it - with some
Bayesian rule perhaps adding a few points to the score. That’s entirely
appropriate. But running bogo-filter instead of spamassassin is just
asinine.

Me, I like bogofilter — those guys are cool, and it’s a great anti-spam product for many purposes. But of course I have to agree with Linus that the correct approach in most cases is a bigger picture than just Bayes alone, a la SpamAssassin ;)

Tags:

This post was written by Justin, source: Linus on Bayesian filtering

More parallel string-match algorithm hacking: re2xs

Thursday, August 17th, 2006

Last week, Matt Sergeant
released
a great little perl script, re2xs, which takes a set of simplified
regexps, converts them to the subset of regular expression language supported
by re2c, then uses that to build an XS module.

In other words, it offers the chance for SpamAssassin rules to be compiled into a trie
structure
in C code to match multiple
patterns in parallel. Given that this is then compiled down to native machine code,
it has the potential to be the fastest method possible, apart from using dedicated
hardware co-processors
.

Sure enough, Matt’s results were pretty good — he says, ‘I managed to match
10k regexps against 10k strings in 0.3s with it, which I think is fairly good.’ ;)

Unfortunately, turning this into something that works with SpamAssassin hasn’t
been quite so easy. SpamAssassin rules are free to use the full perl regular
expression language — and this language supports many features that re2c’s
subset does not. So we need to extract/translate the rule regexps to
simplified subsets. This has generally been the case with all parallel
matching systems, anyway, so that’s not a massive problem.

More problematically, re2c itself does not support nested patterns — if one
token is contained within another, e.g. “FOO” within “FOOD”, then the subsumed
token will not be listed as a match. SpamAssassin rules, of course, are free
to overlap or subsume each other, so an automated way to detect this is
required.

For simple text patterns, this is easy enough to do using substring matching –
e.g. “FOOD” =~ /\QFOO\E/ . Unfortunately, once any kind of sophisticated
regexp functionality is available, this is no longer the case: consider
/FOO*OD/ vs /FOO/ , /F[A-Z]OD/ vs /FO[M-P]/ , /F(?:OO|U)D/ vs /F(?:O|UU)?O/ .

The only way to do this is to either (a) fully parse the regexp, build the
trie, and basically reimplement most of re2c to do this in advance; or (b)
change the trie-generation code in re2c to support states returning multiple
patterns, as Aho-Corasick
does
.

I requested
support for this in re2c
, but got a brush-off, unfortunately. So work
continues…

In other news, that food poisoning thing I had back at the end of June has lingered on. It’s now pretty
clear that it isn’t food poisoning or a stomach
bug… but I still have no idea what it actually is. No fun :(

Tags:

This post was written by Justin, source: More parallel string-match algorithm hacking: re2xs

SpamAssassin advisory CVE-2006-2447

Wednesday, June 7th, 2006

CVE
2006-2447
, in which Radoslaw Zielinski spotted a nasty in spamd’s
‘vpopmail’ support in pretty much all recent versions of Apache SpamAssassin.

If you use spamd with vpopmail, go read the advisory and determine if you need
to take action. Not many people will need to, I think; it’s a very rare setup.
Still, it’s important to get the warning out there anyway.

The irony is that the bug is triggered partly by the “–paranoid” switch. This
was intended to increase security, by increasing paranoia when
possibly-unsafe situations arose — hence providing a great demonstration of
how the addition of optional code paths, even in the best intentions, can
reduce security by allowing bugs to creep in unnoticed.

Tags:

This post was written by Justin, source: SpamAssassin advisory CVE-2006-2447

SpamAssassin in the Google Summer of Code 2006

Sunday, April 30th, 2006

Are you a student, and interested in earning $4,500 for contributing to open
source, and fighting spam, over the course of the summer?

If so, get thee hence to the Google Summer of Code
2006
site, and propose a project!

Last year, we in SpamAssassin didn’t get it
together to mentor SoC projects. This year, however, we have a few prospective
mentors (including myself), and a few sample project
ideas
lined up; we’re all
ready to go! Here’s the Student
FAQ
. Be quick; applications end
in a week and a bit.

Here’s hoping we get some interesting submissions ;)

Tags: 

This post was written by Justin, source: SpamAssassin in the Google Summer of Code 2006

Phishing and Inept Banks

Friday, April 21st, 2006

John-Graham Cumming asks, ‘Are Citibank crazy?’:

I blogged a while ago about Thunderbird’s phishing filter trapping a
seemingly innnocent mail. Now, a reader has forwarded to me a genuine email
from Citibank that he says was trapped by Thunderbird. I’m not going to
reproduce the email here because it contains private details of the user, but
it is a valid Citibank message.

Thunderbird thinks it’s a scam because Citibank uses one of the oldest
phishing tricks in the book. The have a URL displayed in the message then
when clicked goes to a totally different URL.

Sadly, this has proven to be really quite common. We’ve investigated using
this rule as a worthwhile phish-detection rule in SpamAssassin, several times,
and without much luck. In fact, we’ve had to create a FAQ entry for
it
— since it’s
such a superficially-attractive but ultimately useless, idea, many people have
had long discussions on our lists about it!

The companies that produce these false positives in their mails include
American Express, Bed Bath & Beyond, Universal Studios, Microsoft, Hilton
Hotels — and now Citibank.

A couple of other examples from real mails:

  <a href="http://www65.americanexpress.com/clicktrk/Tracking?
    mid=MESSAGEID&msrc=ENG-ALERTS&url=
    https://www.americanexpress.com/estatement/?12345">
    https://www.americanexpress.com/estatement/?12345</a>

  <A HREF="http://echo.epsilon.com/WebServices/EchoEngine/T.aspx?l=ID">
    https://www.hilton.com/en/ww/email/tab_email_subscriptions.jhtml</A>

By the way, it really is quite impressive for a bank as heavily phished as
Citibank to still be making this kind of basic mistake in their mail-outs!
It reinforces a point I made in a mailing list posting recently:

As far as I can see, the approach taken by pretty much all banks to their
online services is simply too bureaucratic, hide-bound, and fundamentally
driven by their marketing departments, to ever cope effectively with
phishing. :(

(For what it’s worth, I know Citi have some smart people working there; but the rest of the company needs to start paying attention to them.)

This post was written by Justin, source: Phishing and Inept Banks

Disclosure

Friday, March 10th, 2006

As of yesterday, I have a new day-job.

I won’t be working on email spam as part of the job, which is an interesting turn of events. However, I’ll be sticking with the open-source Apache SpamAssassin project, and keeping up the rate of work on that [*].

I’m not sure how much I can blog about the new place just yet, but I will say it’s certainly looking like it’ll be very interesting work ;)

[*: modulo the next couple of weeks while I’m waiting for my bloody DSL to be installed. argh!]

This post was written by Justin, source: Disclosure

We Win

Wednesday, March 8th, 2006

ongoing: The ASF Server:

Tim Bray: Which Apache project burns the most resources?

Mads: Spamassassin by a wide margin. […]

Heh, we win ;)

Helios, the Zones server, has been an incredible resource for us. SpamAssassin
isn’t a traditional open-source software project in one respect: we use a lot
of centralized “phone home” infrastructure to support rule and score
generation. Having a virtualized server of this quality and horsepower to use
for this has been fantastic.

(thanks to John O’Shea for the pointer!)

This post was written by Justin, source: We Win

My ApacheCon Roundup

Thursday, December 15th, 2005

Back from ApacheCon!

I’ve got to say, I found it really useful this year. Last year, I
was pretty new to the ASF, and found that my expectations of
ApacheCon didn’t quite match reality; it wasn’t a rip-roaring success
exactly, for me, as a result.

However, many details of how the ASF works — and how the conference
itself works and is organised — are much clearer after you’ve spent
some time lurking and absorbing practices in the meantime. (The
visibility one gets into the process as a member of the ASF makes
this a lot easier.)

Result: it was much more of a success for me this time around.
Plenty of networking, putting faces to the names, hanging out, and
discussing many aspects of our work.

The hackathon really worked out, too; while we didn’t produce a hell
of a lot of code per se, it made for a good ‘developer summit’ and I
think we established solid agreement on SpamAssassin’s short-term
directions and goals. (summary: rules, and faster).

On top of that, I got to meet up with Colm
MacCarthaigh
and Cory
Doctorow
for discussion of Digital Rights
Ireland
. Looks like I’ll be
spending a bit of time on that next year ;)

Finally: Solaris. On Monday night, I got to sit down with Daniel
Price
, one of the kernel engineers behind
Solaris Zones, work
through a quick demo of a bug I was running into with chroot(2) and
zones on our rule-QA buildbot
server
, and watch as he
visually traced it through the OpenSolaris kernel
source
on
the web. From this — and from talking to Daniel — it’s pretty clear
that things have changed at Sun. Pretty much the entire Solaris
operating system is now a full-on open-source project; it’s not just
a marketing gimmick. The source is up there on the web, that’s the
source for the code they’re running now, and there’s no half-assed
‘freeze it, cut out the good bits, and throw it over the wall’
fake-open-source tricks.

The concept of getting this level of access to Solaris source code
and engineers, would have blown my mind when I was Iona’s sysadmin
back in the 1990s ;) I’m very impressed.

This post was written by Justin, source: My ApacheCon Roundup

ApacheCon US 2005

Monday, December 5th, 2005

In a couple of weeks, I’ll be going to San Diego for ApacheCon US 2005 (including the hackathon beforehand). There’ll be quite a few other SpamAssassin committers there, too, so if you’re working with SA, or interested in getting some face time with the developers, there’s no better way of doing so.

This post was written by Justin, source: ApacheCon US 2005

New SpamAssassin Rule Development Tools

Wednesday, November 23rd, 2005

Recently, I’ve been working on new systems to develop SpamAssassin rules
faster, and with a lower ‘barrier to entry’ to the core ruleset. Some
highlights seem bloggable, seeing as it’s all web-based and I can link
to it!

The ‘preflight’ BuildBot:

This uses the fantastic BuildBot continuous-integration
system
to monitor changes to our
Subversion repository.

Every time something is checked into SVN, this wakes up and immediately runs
mass-checks using that latest code and rules, allowing near-real-time viewing
of changes in rule behaviour. (A ‘mass-check’ is a massive run of
SpamAssassin across a corpus of hundreds of thousands of emails, en masse, to
measure rule hit-rates.)

The corpus it mass-checks is split in a certain way so that results will be
available very quickly — typically in under 10 minutes — with increasing
quantities of results becoming available as time elapses.

Progress of the mass-checks are visible at the BuildBot
here
; as they complete, their
results become visible on the Rule-QA app (below). (More
info
, if you’re
curious.)

The Rule-QA App:

To date, we’ve used the basic “freqs” table — output from the
hit-frequencies command-line script — as the UI for rule QA and evaluation.
This is fine for a small number of developers, but it scales badly and (like
mass-checks) requires a pretty complex setup on the developer’s machine.

This new component is a web application, which takes the “freqs” table, and
“webifies” it — demo.

Some major improvements are also made possible; the most important, that it
can now display ‘freqs’ for multiple revisions during the day, and keeps
historical data for comparison. It adds several new reports from
‘hit-frequencies’; a score-map, overlaps, a performance measurement, and a
boolean ‘promoteability’ measurement.

Finally, a really useful new report is the graph of rule hit-rate, as it
changes over time. Here’s a
cached demo
, or see the same data produced ‘live’.
This gives a totally new insight into how the rule hits for various people’s
corpora, how that changed over time, and allows a whole new type of rule
analysis. (In fact, it also allows pretty good corpus analysis, too; can you
tell which submitters bounce high-scoring spam at receipt time?)

(More info on these.)

This post was written by Justin, source: New SpamAssassin Rule Development Tools