Monday, May 26, 2008

POPFile v1.0.1 released plus a glimpse of the future

POPFile v1.0.1 was released today; this is the first ever POPFile release that I didn't do. POPFile is now being managed by a core team of developers: Manni Heumann (in Germany), Brian Smith (in the UK), me (in France), Joseph Connors (in the US) and Naoki Iimura (in Japan). A truly international effort. The actual release binaries were built by Brian Smith who, for a long time, has been the installer guru.

This release contains minor feature improvements and a number of bug fixes. Some of the bugs fixes were for annoying bugs that showed up only occasionally: that makes it a worthwhile upgrade.

Since I pulled back from being involved in every detail of POPFile's evolution the core team has been liberated to work on the project. v1.0.1 is their first release, and it is minor, but much greater things are coming:

1. A native Mac installer

2. A SOHO version of POPFile. Some time ago I did most, but not all, of the work to make a multi-user version of POPFile. That work is being completed by the core team and will allow a single POPFile installation to be shared by multiple users.

Thank you to the POPFile Core Team for this great start to a new chapter in POPFile history.

Labels:

Monday, May 19, 2008

A post (anti-spam-) retirement note

One of the anti-spam companies I was/am involved with, MailChannels, made an interesting announcement recently about a commercial offering for SpamAssassin. What makes the announcement interesting to me is that Justin Mason (who wrote SpamAssassin) is also an advisor to MailChannels.

The program, Traffic Control 3 for SpamAssassin, is a free download and for sites that process less than 10,000 messages per day there's no charge at all (and no need to go and get a license from MailChannels).

Basically, the new product acts as a front-end to SpamAssassin traffic shaping incoming messages so that load is taken off SpamAssassin and the mail server.

Labels:

Thursday, May 01, 2008

The Spammers' Compendium finds a new home

Shortly after I announced that I was getting out of anti-spam the folks at Virus Bulletin contacted me about taking over The Spammers' Compendium. I was delighted.

Today the transfer is complete and the new home is here. It will be maintained and updated by Virus Bulletin. Please send submissions to them.

Labels:

Monday, March 31, 2008

Multi-route (email and phone) self-aware phishing

Today, I received the following email:

This communication was sent to safeguard your account against any
unauthorized activity.

Max Federal Credit Union is aware of new phishing e-mails
that are circulating. These e-mails request consumers to click
a link due to a compromise of a credit card account.

You should not respond to this message.

For your security we have deactivate your card.

How to activate your card

Call +1 (800)-xxx-9629

Our automated system allows you to quickly activate your card

Card activation will take approximately one minute to complete.


Of course, I don't have an account with Max Federal Credit Union and this is obviously a phish. Notice that the English is quite right:

"For your security we have deactivate your card." and "You should not respond to this message." doesn't make sense in context.

What's more interesting is that the message itself warns you about phishing emails and asks you to call an 800 number.

If you call the 800 number an electronic voice reminds you again to never give your PIN, password or SSN in email and then proceeds to ask you for the card number, PIN, expiry date and CVV2. The assumption is that you've been warned twice not to do something in email, so it's OK by phone.

It's painful to see the phisher use the existence of phishing as a way to phish.

Labels:

Tuesday, March 25, 2008

"Retiring" from anti-spam

Today, I'm "retiring" from anti-spam work. Practically, that means the following:
  • No more updates to The Spammers' Compendium or Anti-spam Tool League Table pages. These remain on line, but are not being maintained.
  • I'm looking for a new leader for the POPFile project.
  • I'm no longer active on any anti-spam mailing lists.
  • I am leaving all anti-spam conference committees.
  • My anti-spam newsletter is no longer being published.
I will, however, be continuing with commercial anti-spam work where I have agreements currently in place with customers. No change to their support, terms or assistance.

The obvious question is why? For me, the interest just isn't there. The battle against spam continues but is now about trench warfare rather than creating new weapons. We'll continue to see innovation, but for any hacker it's the new, new thing that's important. For me, spam is yesterday's news. Watching companies squabble and refuse to cooperate, seeing a decline in quality at anti-spam conferences, and major companies essentially killing their consumer anti-spam means anti-spam just isn't where I want to be.

Of course, there are many really good people fighting spam out there. This post isn't meant to demean them.

Thank you to everyone who has supported what I've done over the last 7 years, and good luck!

Labels:

Tuesday, March 11, 2008

First assume all new email is useless

When I download email none of it goes in my Inbox. In fact, I don't have an Inbox. I work on the assumption that all new email is useless.

Many reports tell us that between 80% and 90% of all email is spam, so for starters only 10% to 20% is at all likely to be useful. Then, if you account for being on mailing lists, being CC:ed needlessly and receiving automatic updates such as order confirmations from Amazon.com, you'll see that almost all email is useless. Only a tiny fraction of the mail you receive is useful. And by useful I mean requiring action.

I use Thunderbird and my email folder structure looks like this:



When email arrives it is automatically sorted using POPFile into the folders: Family, GNU Make, Misc, polymail, POPFile and Spam. These six folders are the categories of mail that I receive:

  • POPFile: Since I wrote POPFile I get lots of mail about it and I use this is a general box for other open source projects I work on and anything else about anti-spam
  • polymail: Anything to do with my commercial product polymail and my consulting business
  • GNU Make: Anything to do with GNU Make or the company, Electric Cloud, that I co-founded
  • Family: Anything from my family
  • Misc: Order confirmations, airline tickets, PayPal statements, etc.
  • Spam: spam

POPFile uses Naive Bayesian text classification to automatically sort my email (with just a point and click interface for training) and then six rules (which never need updating) move the incoming mail based on POPFile's classification to one of those folders.

Of course, POPFile can be used to sort mail in any way you choose: my categories are unlikely to be yours. You might use POPFile to sort Work from Home from Spam. At least one journalist I know uses POPFile to sort Interesting from Boring from Spam so that he only gets to read interesting press releases.

When I identify mail that does need action taken I move it to the ACTION folder (which is the closest I've got to an Inbox). Moving mail there is a snap because I use the QuickMove extension for Thunderbird and have ALT-number keys mapped to each folder: one key press and the message is moved into or out of ACTION.

To keep on top of things I publish the number of items in my ACTION folder on my web site. Here's a live view over the last 24 hours. Currently, 9 items need dealing with.



My rules for managing email:

  1. Assume that all new email is useless
  2. Automatically sort email into folders on delivery
  3. Take control of your Inbox: only you put email in it

Labels: ,

Monday, March 03, 2008

To the idiotic spammer posting comment spam on this site

Since your name is two Chinese characters I'm going to address you as "Dude".

Dude,

Lately you've been posting comment spam on my blog for your World of Warcraft Gold. This is a little silly:

1. I'm fairly well known in anti-spam circles, did you really think I was going to let comment spam through on this site?

2. Comment moderation is turned on on this site. So your comment spam goes nowhere when I click the Discard button.

3. There has been a little some collateral damage from your World of Warcraft spamming. I accidentally killed two comments by Hypermechanic and I can't retrieve them. He/she wanted to say something useful about an old post:


You could do that like cameroid.com .
I guess in JAVA or .NET.

and

Cool I will hunt for it… This is a very sweet look app you have here. Even though you down play your role this is still brilliant.

Thank you for something new and useful.

Labels:

Wednesday, February 06, 2008

A clever, targeted email/web scam with a nasty sting

Steve Kirsch sent me an interesting message he'd received from seatac@bbb.org (i.e. an email address from the Better Business Bureau) containing an apparent complaint from a customer submitted through the BBB. The email itself was actually sent from a BellSouth ADSL line (i.e. almost certainly a zombie machine). The address was not authorized to send as bbb.org according to BBB's SPF records.

But the content of the email message is very interesting. Here's a screenshot:



Notice how the email contains the correct address for Steve, his name and the name of his company and thus appears to be a real complaint. The link below the complaint, where you can get full details, is the first of two nasty stings in this message.

The actual URL is:

http://search.bbb.org/ViewReport.aspx?sess=32b7aa40a693b0296624c132240947d0d
39365d555819ead6b5e59cc7f257bdeb1fc30198daafdfe88937338af1519acbeafcc25e5a3153db
0320ba5d0c0579a775c83632dd36d7971b95d9f85e64bdb&lnk=http%3a%2f%2faltaconsultants.com
%2fcomplaints%2fViewReport.php?case=840915898&biz=&bbb=1186

i.e. the link actually goes to the BBB's own web site (making it seem even more likely that this is a genuine message). The link manipulates the search option on the BBB web site using the lnk parameter to perform a redirect to http://altaconsultants.com/complaints/ViewReport.php?case=840915898 which in turn redirects to http://www.kfsolicitors.com/complaints/ViewReport.php?case=840915898. And it's on that, presumably hacked, site that the real scam starts.

If you are not using Microsoft Internet Explorer you'll be presented with the following web page:



Once you've upgraded you get told that the web site requires the "Adobe Acrobat ActiveX" control and you need to install it.



The control itself is embedded using the following code:

<object classid="clsid:D68E2896-9FD9-4b70-A9AE-CCDF0C321C45" height="0" width="0" codebase="Acrobat.cab"></object>

Notice how instead of pointing to Adobe's web site to get the control it's available locally as Acrobat.cab. So when you follow the instructions you download and install an ActiveX control from the scammer web site.

Once you've done that you get told that in fact the customer has withdrawn their complaint and there's nothing to worry about:



Now for the second sting. There must be something about this ActiveX control that's malicious... the scammer didn't go to all that trouble for nothing. But none of the current anti-virus programs report any problems with the file.

For example, my Sophos anti-virus says nothing, and online scanners such as Kaspersky's say that it's clean:



So, perhaps the file really is clean, but I suspect that this is a new threat which isn't currently detected by anti-virus. I'll post again when I get a response from Sophos' anti-virus brainiacs. Perhaps, I'm wrong but be very wary of these mails.

Further information about BBB related scams on their web site.

UPDATE: McAfee WebImmune tells me that this is a new detection of the spy-agent.cf SpyWare which steals information about your web surfing.

UPDATE: A scan using VirusTotal shows that very few anti-virus programs are detecting this (although their version of Kaspersky is finding it---curious that the online Kaspersky scanner does not).

Labels: ,

Thursday, October 18, 2007

Times Square: a fun spammer GIF

Nick FitzGerald reported a neat spammer image trick to me the other day. It's entered in The Spammers' Compendium that involves using animation to display the word Viagra emulating a flashing neon sign.

Since many OCR systems merge the layers together before OCR this image is actually in the 'wrong' order. Once merged the letters are in the order VIRAAG.

Labels:

Thursday, October 11, 2007

More spammer crossword creativity

Nick FitzGerald writes in with a variant of the "1 across, 3 down" spammer content trick which looks like this:



The neat thing is that the crossword is created using HTML in a way that prevents a simple HTML-stripping spam filter from reading the brand names. To a simple spam filter this looks like:

CA
BREIT
OM
R
O
L
E
TIER
ING
G
X

The actual HTML (cleaned up by Nick) is:

<TABLE>
<TR>
<TD>
<DIV align=right>
CA<BR>
<BR>
BREIT<BR>
OM
</DIV>
</TD>
<TD>
<DIV align=center>
R<BR>
O<BR>
L<BR>
E
</DIV>
</TD>
<TD>
TIER<BR>
<BR>
ING<BR>
GA
</TD>
</TR>
<TR>
<TD>
<DIV align=center>
X
</DIV>
</TD>
</TR>
</TABLE>

Labels:

Friday, August 31, 2007

Useful sources of messages for testing spam filters

- Enron Corpus

- PU corpus

- SpamAssassin public corpus

- TREC 2005 Public Spam Corpus

- TREC 2006 Spam Track Public Corpora

- 20 Newsgroups

Labels:

Saturday, July 14, 2007

Last push for POPFile voting

POPFile has been nominated for a SourceForge Community Choice Award due to the efforts of many people.

Voting closes on July 20 and there's strong competition in POPFile's category from the likes of Pidgin (formerly GAIM) and phpBB.

If POPFile wins SourceForge will be making a donation to a charity that I picked: Doctors without Borders.

If you feel like voting for POPFile, please vote in the Best Project for Communications category here:

http://sourceforge.net/awards/cca/vote.php

Labels:

Tuesday, July 03, 2007

Please vote for POPFile

POPFile has been nominated for a SourceForge Community Choice Award through the efforts of its users.

Now it's time to vote.

If you love POPFile, please vote for it in the Best Project for Communications category.

Labels: ,

Tuesday, June 26, 2007

Pretty Darn Fancy: Even More Fancy Spam

Looks like the PDF wave of spam is proceeding with a totally different tack. Sorin Mustaca sent me a PDF file that was attached to a pump and dump spam. Unlike the previous incarnation of PDF spams, this one looks a lot like a 'classic' image-spam.



It's some text (which has been misaligned to fool OCR systems), but there's a little twist. Zooming in on a portion of the spam shows that the letters are actually made up of many different colors (a look inside the PDF it's actually an image.



I assume that the colors and the font misalignement is all there to make it difficult to OCR the text (and wrapping it in a PDF will slow down some filters).

Extracting the image and passing it through gocr gave the following result:

_ Ti_ I_ F_ _ Clih! %
ogx.

%_ % _c. (sR%)
t0.42 N %x

g0 _j_ __ h_ climb mis _k
d% % %g g nle_ Frj%.
_irgs__Dw_s _ rei_ � a
f%%d __. n_ia une jg gtill
oo_i_. Ib _ _ _ _ _ _ 90I
Tgdyl

And a run through tesseract resulted in:

9{Eg Takes I:-west.tY`S Far Gawd Climb! UP
fatued Sta=:khlauih. This we 1: still

Close, but no cigar.

Labels:

Thursday, June 21, 2007

Pretty Darn Fancy: Stock spammers using PDF files

Joe Chongq sent me a fascinating spam that is the first I've seen that's using a PDF file to send its information. I've long predicted that we'll see a wave of complex formats used for spam as broadband penetration increases and sending large spams becomes possible.

This particular spam has a couple of interesting attributes:

1. The PDF file itself is a really nicely formatted report about a particular stock that's being pumped'n'dumped.

2. The file name of the PDF was chosen to entice the user further by using their first name. In this case it was called joe_report.pdf.

3. The PDF is compressed using the standard PDF 'Flate' algorithm and totals 84,398 bytes. That's fairly large, but we've certainly seen image spams that were larger. Use of compression here means that a spam filter that's not aware of PDF formats would be unable to read the message content.

Here's what the actual PDF looks like (click for a larger view):

.

Labels:

Wednesday, June 20, 2007

jeaig (jgc's email address image generator) launches

Today is launch day for a simple little web site called jgc's email address image generator. It's a web service that enables anyone to generate a CAPTCHA like image containing their email address for insertion on their web site.

Since web crawlers are currently unable to look inside images to scrape email address this means that your email address will not be scraped from a web site, but can be written down by a human.

Here's an example email address:


Made using jeaig

and the code that generates that is:

<img src="http://jeaig.org/image/UmFuZG9tSVacujftR-zi3H6s63Vd-
6iIaBfIo-zwZhTmeKRTo4CL818NTBYEAUT4sFRAFiQhbkzU*tHehFDLYniqtHVc3CTr.png">
<br />
<font size="-2">Made using <a href="http://jeaig.org/">jeaig</a></font>

The server does not store the email address: when a user enters an email address on the site it is padded with a random amount of random data (both before and after the address) with randomness being supplied by /dev/urandom. Then the address is encrypted using Blowfish using a secret key known only to the jeaig server (this key was also generated from /dev/urandom) and a random IV is chosen.

The encrypted data is then modified base-64 encoded so that it can be used in the URL.

When the image is requested using the special base-64 filename the email address is decrypted and then rendered using CAPTCHA code to produce the image. The image changes each time the email address is loaded.

And, yes, this is a free service.

Labels: ,

Monday, June 18, 2007

POPFile v0.22.5 Released

Here are the details:

Welcome to POPFile v0.22.5

This version is a bug fix and minor feature release that's been over a
year in the making (mostly due to me being preoccupied by other
things).


NOMINATING POPFILE FOR AN AWARD

SourceForge has announced their 'Community Choice Awards' for 2007 and
is looking for nominations. If you feel that POPFile deserves such an
honour please visit the following link and nominate POPFile in the
'Best Project for Communications' category. POPFile requires multiple
nominations (i.e. as many people as possible) to get into the list of
finalists.

http://sourceforge.net/awards/cca/nomination.php?group_id=63137

Thanks!


WHAT'S CHANGED SINCE v0.22.4

1. POPFile now defaults to using SQLite2 (the Windows installer will
convert existing installations to use SQLite2).

2. Various improvements to the handling of Japanese messages and
improvements for the 'Nihongo' environment:

Performance enhancement for converting character encoding by
skipping conversion which was not needed. Fix a bug where the
wrong character set was sometimes used when the charset was not
defined in the mail header. Fix a bug where several HTML
entities caused misclassification. Avoid a warning 'uninitialized
value'. Fix a bug that the word's links in the bucket's detail
page were not url-encoded.

3. Single Message View now has a link to 'download' the message from
the message history folder.

4. Log file now indicates when an SSL connection is being made to the
mail server.

5. A number of small bug fixes to the POPFile IMAP interface.

6. Installer improvements:

Email reconfiguration no longer assumes Outlook Express is
present. Add/Remove Programs entries for POPFile now show a more
realistic estimate of the program and data size. Better support
for proxies when downloading the SSL Support files. The SSL
patches are no longer 'hard-coded', they are downloaded at
install time. This will make it easier to respond to future
changes to the SSL Support files. The Message Capture Utility
now has a Start Menu shortcut to make it easier to use the
utility. The minimal Perl has been updated to the latest
version. The installer package has been updated to make it work
better on Windows Vista (but further improvements are still
required).


WHERE TO DOWNLOAD

http://getpopfile.org/wiki/download


GETTING STARTED WITH POPFILE

An introduction to installing and using POPFile can be found in the
QuickStart guide:

http://getpopfile.org/wiki/QuickStart


SSL SUPPORT IN WINDOWS

SSL Support is offered as one of the optional components by the
installer. If the SSL Support option is selected the installer will
download the necessary files during installation.

If SSL support is not selected when installing (or upgrading) POPFile
or if the installer was unable to download all of the SSL files then
the command

setup.exe /SSL

can be used to run the installer again in a special mode which will
only add SSL support to an existing installation.


CROSS PLATFORM VERSION KNOWN ISSUES

The current version of SQLite (v3.x) is not compatible with POPFile.
You must use DBD:SQLite2 to access the database.

Users of SSL on non-Windows platforms should NOT use IO::Socket::SSL
v0.97 or v0.99. They are known to be incompatible with POPFile; v1.07
is the most recent release of IO::Socket::SSL that works correctly.


v0.22.0 RELEASE NOTES

If you are upgrading from pre-v0.22.0 please read the v0.22.0 release
notes for much more information:

http://getpopfile.org/wiki/ReleaseNotes


DONATIONS

Thank you to everyone who has clicked the Donate! button and donated
their hard earned cash to me in support of POPFile. Thank you also to
the people who have contributed their time through patches, feature
requests, bug reports, user support and translations.

http://sourceforge.net/forum/forum.php?forum_id=213876


THANKS

Big thanks to all who've contributed to POPFile over the last year.

John.

Labels: ,

Tuesday, June 05, 2007

A little light relief in an 'enlargement' spam

Jason Steer from IronPort sent me over an image taken from a recent spam for an 'enlargement' product. Here's the image:



Since the text is large I thought it would be fun to run this through gocr to OCR out the text and URL. Here's the output of gocr on the image (I removed a few blanks lines for clarity here):

(PICTURE)
__%[0______
__ ___


______
_ ____

So, that worked out well :-) Nevertheless, the domain is listed in the SURBL:

$ dig relies.net.multi.surbl.org
;; QUESTION SECTION:
;relies.net.multi.surbl.org. IN A

;; ANSWER SECTION:
relies.net.multi.surbl.org. 2100 IN A 127.0.0.4

So, if you can extract the domain name from the image it's possible to check it against the SURBL and blacklist the message. Switching over to Google's Tesseract OCR system revealed the following:

(I LIT inl 3
IE\)' :
Q ii ,g @1

Lgiiiizj
i
H ; ik
s$i` wg `i?! J

Labels:

Friday, May 25, 2007

Back from the EU Spam Symposium; here's my talk

So I'm back home from the 2007 EU Spam Symposium which was held in Vienna in Austria and you can grab my presentation here. You'll notice that the presentation template is from MailChannels. They very kindly sponsored my trip to Vienna and so I did a little publicity for them. There's only one slide, however, that's actually anything to do with MailChannels in the entire presentation, so don't expect a product pitch!

One thing I didn't mention in my talk was that as the number of Internet hosts expands and the number of broadband subscribers grows the number of competing botnets can also grow. That means I'd expect to see the price of botnet rental dropping as the Internet grows leading to lower costs for spammers.

I'll give a complete round up of the conference in my newsletter next week, but overall there were some interesting talks, and meeting some people like Richard Cox from SpamHaus and Richard Clayton was very useful.

Labels:

Tuesday, May 15, 2007

Some architectural details of Signal Spam

Finally, Signal Spam, France's new national anti-spam system, launched and I'm able to talk about it. For a brief introduction in English start here.

I'm not responsible for the idea behind Signal Spam, nor for its organization, but I did write almost all the code used to run the site and the back end system. This blog post talks a little bit about the design of Signal Spam.

Signal Spam lets people send spams via either a web form, or a plug-in. Plug-ins are currently available for Outlook 2003, Outlook 2007 and Thunderbird 2.0; more coming. Currently Signal Spam does three things with every message: it keeps a copy in a database after having extracted information from the body and headers of the message; it figures out if the message came from an ISP in France and if so sends an automatic message to the ISP indicating that they've got a spammer or zombie in their network; it figures out if the message was actually a legitimate e-marketing message from a French mailer and informs the person reporting the spam of how to unsubscribe.

The original plan was that the system be capable of handling 1,000,000 messages per day allowing for peaks of up to 1000 messages per minute (such as when people first come to work in the morning) and that messages would be handled in near real-time (i.e. the time from a message being received by the system to it being analyzed and forwarded to an ISP would be under 60 seconds). Signal Spam also wanted a lot of flexibility in being able to scale the system up as use of the site grew and being able to do maintenance of the site without taking it down.

Here's the last 12 hours of activity on the site, which pretty much matches what we expected with a peak once people get to work and start reading their mail. (These charts are produced automatically by the lovely RRDTool.)



The system I built is split into two parts: the front end (everything that the general public sees including the API used by the plug-ins) and the back end (the actual database storing the messages sent, the software that does analysis and the administration interface). Communication between the front end and the back end uses a web service running over HTTPS.

To make things scale easily the back end is entirely organized around a simple queuing system. When a message arrives from a user it is immediately stored in the database (there are, in fact, two tables: one table contains the actual complete message as a BLOB and the other contains the fields extracted from the message. The messages have the same ID in each table and the non-BLOB table is used for searching and sorting).

Once stored in the database the message ID is added to a FIFO queue (which is actually implemented as a database table). An arbitrary number of processus handle message analysis by dequeuing IDs from the FIFO queue (using row-level locking so that only one process gets each ID). Once dequeued the message is analyzed: fields such as From, Subject, Date are extracted and stored in the database, the Received headers are walked using a combination of blacklist lookup and forgery detection to find the true IP address that injected the message into Internet, the IP address is mapped to the network that manages the IP address, fingerprints of the message are taken and all URLs inside the message are extracted.

Once the analysis is complete the process decides whether the message needs to be sent to an ISP. If so it enqueues the message ID on another FIFO queue for a separate forwarding process to handle. If the message is in fact a legitimate message then the message ID is enqueued on a FIFO queue for another response process to handle.

The forwarding process generates an ARF message to the appropriate ISP and sends the message for every ID that it dequeues from the queue using VERP for bounce or reponse handling.

The response process dequeues IDs and responsed to the original reporter of the spam with a message tailored for the specific e-marketer with full unsubscribe details.

The use of queues and a shared database to handle the queues, plus a simple locking strategy means that arbitrary numbers of processes can be added to handle the load on the system as required (currently there is only one process of each type running and handling all messages in the delay required). It also means that the processus do not need to be on the same machine and the system can scale by adding processus or adding hardware.

Stopping the processes does not stop the front end from operating. Messages will still be added to the database and the analysis queue will grow. In fact, the length of the queue makes measuring the health of the system trivial: just look at the length of the queue to see if we are keeping up or not.

Since the queue has the knowledge about the work to be done processus can be stopped and upgraded as needed without taking the system off line.

To hide all this the entire system (which is written in Perl---in fact, the back end is entirely LAMP) uses an object structure. For example, creating the Message object (passing the raw message into the constructor) performs the initial message analysis and queues the message for further analysis. Access to the database queues is entirely wrapped in a Queue object (constructor takes the queue name). These objects are dynamically loaded by Perl and can be upgraded as needed.

Finally, all the objects (and related scripts) have unit tests using Perl's Test::Class and the entire system can be tested with a quick 'make test'. One complexity is that most of the classes require access to the database. To work around this I have created a Test::Database class that is capable of setting up a complete MySQL system from scratch (assuming MySQL is currently installed) and loading the right schema, that is totally independent of any other MySQL instance. The class then returns a handle (DBI) to that instance plus a connect string. This means the unit tests are totally independent of having a running database.

In addition, the unit tests include a system that I created for POPFile which allows me to get line-by-line coverage information showing what's tested and what's not. By running 'make coverage' it's possible to run the test suite with coverage information. This gives percentage of lines tested and for every Perl module, class and script a corresponding HTML file is generated with lines colored green (if they were executed during testing) or red (if not). The coloring is achieved by hooking the Perl debugger (see this module from POPFile for details).

Here's an example of what that looks like (here I'm showing Classifier/Bayes.html which corresponds to Classifier/Bayes.pm in the POPFile code, but it looks the same in Signal Spam):



The green lines (with line numbers added) were executed by the test suite; the red line was not (you can see here that my test suite for POPFile didn't test the possibility that the database connect would fail).

Labels: ,

Perhaps OCRing image spams really is working?

I've previously been skeptical of the idea that OCRing image spams was a worthwhile effort because of the variety of image-obfuscation techniques that spammers had taken to using.

But Nick FitzGerald has recently sent me an example of an image spam that seems to indicate that spammers are concerned about the effectiveness of OCR. Here's the image:



What's striking is that the spammer has used the same content-obscuring tricks that we've seen with text (e.g. Viagra has become Vi@gr@), perhaps out of fear that the OCRing of images is working and revealing the text within the images.

Or perhaps this spammer is just really paranoid.

Labels:

Tuesday, April 17, 2007

No newsletter for April 15, 2007

This blog post is really for people who read my spam and anti-spam newsletter. There won't be one this week (it was due on April 15) because I wanted to spend time in the newsletter reviewing the MIT Spam Conference 2007.

Unfortunately, the videos of the presentations on the web site are either of very poor quality or have no sound at all. This makes reviewing the presentations very hard. Bill Yerazunis has promised better videos 'coming soon', and I'm waiting for them.

While I'm complaining I also find that the 'download an ISO' to get the papers to be ridiculous. I've asked Bill to provide individual links to each of the papers and presentations so that people can just click and download what they want. He says...

We *could* have individual links. I assume the
individual authors would do so in any case.

And I'll probably do that later.

But right now, I'm sorta trying to get people to
at least glance at all the papers, and put the CDROMs into
their local libraries; that's the motivation.

So, my next newsletter will be out on April 30 with a review of the MIT Spam Conference and other regular news.

Labels:

Thursday, March 15, 2007

Calibrating a machine learning-based spam filter

I've been reading up about calibration of text classifiers, and I recommend a few papers to get you started:

The overall idea is that the scores output by a classifier need to be calibrated so that they can be understood. And, specifically, if you want to understand them as a probability that a document falls into a particular category then calibration gives you a way to estimate the probability from the scores.

The simplest technique is bucketing or binning. Suppose that the classifier outputs a score s(x) in the range [0,1) for an input document x. Once a classifier is trained it's possible to calibrating it by classifing known documents, recording the output scores and then for each of a set of ranges (e.g. divide [0,1) into 10 ranges [0.0,0.1), [0.1,0.2) and so on) count the number of documents in each class, and get a fraction the represents the probability within that range. Then when you classify an unknown document you look at the range in which its score falls to get probability estimates.

For spam filtering each range would consist of the number of emails that are ham in the range, and the number that are spam. The ham probability then just comes out to # of ham emails in the range / total number of emails in the range.

I decided to run a little test of this and took the complete TREC 2005 Public Spam Corpus and trained a Gary Robinson style spam filter on it. Then I classified the entire corpus to get calibration data. Instead of looking at the final scores, I looked at the pair of scores (one for ham, one for spam) that the classifier generates and used those to generate bins. The following chart shows (crudely) the percentage of spams and hams in each bin:



The x-axis is the ham score in the range [0.0,0.9) with each square representing a single 0.1-width bin. The left-most square means that the classifier had a very low score in terms of hamminess, right-most square means that the classifier had a very high score.

The y-axis is the spam score in the [0.0,0.9), so that the bottom means a low spamminess score and the top a high one. Each square is colored red and green. Red is the percentage of messages in that bin that are spam, and green the percentage of messages in that bin that are ham.

So as you'd expect the top left corner (low ham score, high spam score) is generally colored red indicating that these are spam messages, and the bottom right corner (low spam score, high ham score) is colored green (lots of ham). Some squares are black because there's too little data (I arbitrarily said that if there were less than 5 messages in the bin it was colored black).

But there's a really interesting square in the top right: it's almost all red (in fact, it's 93% red or spam). So we can have confidence that a message there (which corresponds to a final 'score' of 0.5) is in fact spam.

I'm sure that there are other interesting insights to be gained from calibration and I'd like to spend the time to evaluate the effectiveness of using the probabilities generated in this way (instead of the rather arbitrary selection of a score 'cutoff' point) as the way to determine whether a message is spam or ham (or undetermined). For example, the square at (0.9,0.2) (which corresponds to a 'score' of 0.18 is 57% ham, 43% spam so looks like a good candidate for undetermined; it looks like a much better candidate than the typical "area around 0.5" (which corresponds to (0.9,0.9) and is 93% spam).

Labels: ,

An image-based spam that brags about its own delivery

Nick FitzGerald sent me a simple image-based spam for Viagra which starts with a brag on the part of the spammer:



Nick calls it a 'self aware' spam.

Labels:

Wednesday, February 28, 2007

Image spammers doing the twist

It's been quite a while since I last blogged about ever changing image spam. Anna Vlasova wakens me from my unblogging slumber with some great samples of recent image spams were the spammer has decided to rotate the entire image to try to avoid detect. Take a look at this first one:

The spammer has really gone to town here:
  • There's random speckling all over the images to upset hashing and OCR techniques
  • There's no URL in the message itself (it's in the image)
  • The entire image has been rotated to the left to obscure the text

And, of course, they are not going to be content with just one rotation and can randomize the angle per message:

And they've gone even further by slicing the image up, randomizing the angle and overlaying the elements using animation.

Labels:

Wednesday, February 07, 2007

Trusted Email Connection Signing (rev 0.2)

IMPORTANT: This blog post deprecates my previous posting on this subject. The blog post Proposal for connection signing reputation system for email is deprecated.


Sign the medium, not the message



The motivation behind TECS (Trusted Email Connection Signing) is that what managers of MX servers on the public Internet really care about is the ability to distinguish a good connection (coming from a legitimate sender and which will be used to send wanted email) from a bad connection (coming from a spammer). If you can identify a bad connection (today, you do that using an RBL or other reputation service based on the IP address of the sender) you can tarpit or drop it, or subject the mails sent on the connection to extra scrutiny. If you can identify a good connection it can bypass spam checks and help reduce the overall false positive rate.

If you are a legitimate bulk mailer (an email marketer, for example) then you care deeply that you reputation being recognizable and that mail sent from you be delivered. Currently, you have to carefully tend your IP addresses to make sure that they don't appear on blacklists, and you have to ensure that new IP addresses are clean.

If you are running a large email service (e.g. Yahoo! Mail) then you are currently trying to build white and blacklists of IP addresses, when what you really want are white and black lists of entities.

Currently, the options used to identify a bad connection are rather limited (RBLs, paid reputation services and grey listing), and good connections are hard to manage (whitelists on a per-recipient basis, or pay-per-mail services). What's needed is a different approach.

The idea is to identify and determine the reputation of the entity connecting to a mail server in real-time without resorting to a blacklist or whitelist. This is done by signing the connection itself. With the signature on a per-connection basis a mail server is able to determine who is responsible for the connection, and then look up that entity's reputation in a database.

Current reputation databases are based on IP addresses. This is a very inflexible system: IP addresses must be added to blacklists very fast as spammers churn through zombie machines, and any legitimate emailer needs to make sure their mail servers are whitelisting with multiple email providers (e.g. Yahoo!, Gmail, Brightmail, ...) to ensure delivery. And if a legitimate mailer wants to bring on line new servers, with new IP addresses they have to run through the entire whitelisting process again.

This is inefficient. The mapping between IP address and entities (e.g. knowing that Google's Gmail services uses a specific set of IP addresses) is unwieldy to manage and the wrong level of granularity. Google should be free to add and remove email servers at will, while carrying their good reputation with them.

That's what TECS gives you.

Connection Signing

TECS is an extension to the existing SMTP AUTH mechanism (see RFC 2554) and implements an authentication mechanism that I'll refer to as TECS-1 (the 1 here acts as a version number on the protocol). TECS-1 would need to be registered as a SASL (see RFC 2222) authentication mechanism.

When a mail sender connects to an SMTP server wishing to sign its connection it issues the EHLO command and if that SMTP server is capable of handling AUTH the mail sender then signs the connection using the AUTH command, with the TECS-1 mechanism followed by an initial response (which contains the TECS signature) as defined in RFC 2554.

Here's an example session:

S: 220 smtp.example.com ESMTP server ready
C: EHLO jgc.example.com
S: 250-smtp.example.com
S: 250 AUTH CRAM-MD5 DIGEST-MD5 TECS-1
C: AUTH TECS-1 PENCeUxFREJoU0Nnbmh1YWVlMDljNDBhZjJiODRhMGMyYjNiYmFlNzg2ZQ==
S: 235 Authentication successful.

The 'initial response' section of the of the AUTH command is a base-64 encoded string containing the following structure (this is deliberately similar to the DKIM fields):

a=rsa-sha256; q=dns; d=jgc.org; b=oU0Nnbmh1YWVlMDljNDBhZjJiO==

a= is the cryptographic method used (default would be RSA/SHA-256 with suitable padding as described in PKCS#1 version 1.5 RFC 3447).

d= is the name of the domain signing the connection. In the example above I am showing a connection that is being signed (and hence claimed by) jgc.org.

q= is a query type with the default being the use of a DNS TXT record. This query method is used to obtain the public key associated with the signing domain. The public key would be obtained by looking up _tecs.jgc.org and getting the associated TXT record.

b= is the binary signature for the connection generated using the method in a= by the d= domain.

The connecting server signs the tuple consisting of ( destination IP/port, source IP/port and epoch ); that way they sign the current connection and verify that they are responsible for the mail sent across it.

Each entity has an RSA key public/private key pair. When signing a connection the entity generates a SHA-256 hash of the tuple. The destination IP/port pair is the IP address and port on the mail server that the mail sender is currently connected to; similarly the source IP/port pair is the IP address and port of the connection being used by the
mail sender. The epoch is the standard Unix epoch rounded to the nearest 30 seconds.

The entity making the connection then encrypts the hash with their private key.

Update (February 8, 2007): A number of people have suggested getting the public key from the _domainkey (i.e. DKIM) label of the d= domain. This seems like a good idea since there's no need to reinvent the wheel.

Update (February 9, 2007). A few people pointed me in the direction of the MARID CSV proposal (see CSV). I've addressed this below.

I don't want to sign my outbound SMTP connections
Well don't then. If TECS were implemented you'd probably find you'd have trouble getting your mail delivered as non-signing (and hence not-taking-responsibility) would be looked upon very poorly.

I'm big ISP and I don't want to sign as myself, can I sign as a customer?
Sure, for example, jgc.org's mail is actually handled by lists.herald.co.uk. When that server was sending mail for jgc.org it could sign as jgc.org as long as it has access to jgc.org's private key. It would simply specify d=jgc.org in the TECS data.

Why don't you just use STARTTLS with certificates?
Because that's a very heavyweight system, designed for something else. A SASL extension using SMTP AUTH is simple and clean.

Why do you think connection signing is useful?
Because SMTP server resources are precious. Being able to make a decision about a connection before any mail is delivered is very useful. An SMTP server owner could use reputation data from a public or private source to decide whether to accept or reject a connection, slow down a connection, apply little or extra scrunity to a connection, etc. Being able to do this before receiving a ton of mail and tying up a server is very valuable.

Won't spammers just sign their connections?
No doubt, but that's hardly a worry, being able to identify the good senders fast is the most important goal.

Why don't you just signed the destination IP/port pair? That's known before the connection is made and avoids problems with NAT
TECS could just sign the tuple ( destination IP, port, epoch ) but I think it's a bit weaker than my proposal. Since the destination IP and port are fixed for a given MTA the signature is really a signature on the time. An eavesdropper could reply the signature within 30 seconds (or other timeone on the epoch value) and get an authenticated connection from any source IP address.

What about the MARID CSV proposal?
The CSV proposal is a lightweight (DNS-based), non-cryptographic method of estabilishing whether a host claiming a certain domain name in HELO/EHLO is authorized to be an SMTP client. Clearly, CSV aims to provide a simple method of determining whether a connecting SMTP client is authorized to be an SMTP client with the claimed name. This seems like a useful extension, but is very different from TECS. TECS operates at the level of a specific connection, and with an entity that is distinct from the domain of the SMTP client. This is valuable for two reasons: it allows the identity to be moved from SMTP service provider to provider, and it means that shared SMTP servers can operate claiming different 'responsible parties' for each connection. This latter point is important for ISPs that provide SMTP services to email marketers where the same SMTP server may be shared across many clients. This can result in a clean emailer being blacklisted because the IP of the shared server was blacklisted because of some other unrelated misbehaviour.

Whilst CSV is a useful extension which would help with the zombie problem, it does not address the needs at the connection level where I believe the problem needs to be addressed.

CSV also provides specific services for checking domain names against accreditation services. That is outside the scope of TECS, although the assumption is that such services would exist for TECS signed connections against the domain name claiming responsbility. The bottom line is that TECS deals with the party responsible for a connection, CSV the party responsible for the server.

What about mailing lists that forward mail?
By signing their connections they take responsibility for the mails they are sending. So mailing lists would need to have appropriate email policies in place for unsubscriptions, and deal themselves with spam to the list. Since the connection is signed any concern about munging of From: addresses for VERP handling, or adding headers/footers to email are irrelevant.

Is this compatible with SPF, Sender-ID, DomainKeys?
They are orthogonal. There's no direct interaction. Although, it might be sensible to use the _domainkey record from DKIM to obtain a public key thus sharing the same key between DKIM and TECS.

Will this reduce spam?
I'm not going to make any predictions. The goal would be to build a database that makes it easier to recognize someone who is legitimate, and scrutinize those who abuse the system or who choose not to sign.

What about anonymity?
Anoymous remailers are unaffected. They could sign their outbound connections with the system but that would not affect any changes they make to anonymize messages since its the conneciton, not the message content that's signed.

What if I change the mail servers or IP addresses I am using?
There's no effect. Keep signing the connections and you can take responsibility for any IP address you want to.

I think you are wrong, right, stupid, a genius.
Please comment here, or write to me directly.


Many thanks to all members of the REDACTED discussion forum, and to Toby DiPasquale.

Labels:

Thursday, February 01, 2007

Proposal for connection signing reputation system for email: TECS

IMPORTANT: This blog post is deprecated. Please read Trusted Email Connection Signing (rev 0.2) instead


The motivation behind TECS (Trusted Email Connection Signing) is that what managers of MX servers on the public Internet really care about is the ability to distinguish a good connection (coming from a legitimate sender and which will be used to send wanted email) from a bad connection (coming from a spammer). If you can identify a bad connection (today, you do that using an RBL or other reputation service based on the IP address of the sender) you can tarpit or drop it, or subject the mails sent on the connection to extra scrutiny. If you can identify a good connection it can bypass spam checks and help reduce the overall false positive rate.

Currently, the options used to identify a bad connection are rather limited (RBLs, paid reputation services and grey listing), and good connections are hard to manage (whitelists on a per-recipient basis, or pay-per-mail services). What's needed is a different approach.

There are also ideas like SPF, Sender-ID and DomainKeys which all attack the problem of protecting the integrity of the From: portion of a message.

TECS is different. The idea is to identify and determine the reputation of the entity connecting to a mail server in real-time without resorting to a blacklist or whitelist. This is done by signing the connection itself. With the signature on a per-connection basis a mail server is able to determine who is responsible for the connection, and then look up that entity's reputation in a database.

Current reputation databases are based on IP addresses. This is a very inflexible system: IP addresses must be added to blacklists very fast as spammers churn through zombie machines, and any legitimate emailer needs to make sure their mail servers are whitelisting with multiple email providers (e.g. Yahoo!, Gmail, Brightmail, ...) to ensure delivery. And if a legitimate mailer wants to bring on line new servers, with new IP addresses they have to run through the entire whitelisting process again.

This is inefficient. The mapping between IP address and entities (e.g. knowing that Google's Gmail services uses a specific set of IP addresses) is unwieldy to manage and the wrong level of granularity. Google should be free to add and remove email servers at will, while carrying their good reputation with them.

That's what TECS gives you.

Now for the how. To work TECS requires two things: a reputation authority and an algorithm. Let's start with the second.

Connection Signing

When a mail sender connects to an SMTP server wishing to sign its connection it issues the EHLO command and if that SMTP server is capable a new extension command TECS will be available. After the EHLO the mail sender then signs the connection using the TECS command.

The TECS command has two parts: an identifier (this is the unique identifier of the entity signing the connection, and thus taking responsibility for the messages send across the connection) and a signature.

Each entity has an RSA key public/private key pair. When signing a connection the entity generates a SHA-256 hash of the tuple . The destination IP/port pair is the IP address and port on the mail server that the mail sender is currently connected to; similarly the source IP/port pair is the IP address and port of the connection being used by the
mail sender. The epoch is the standard Unix epoch rounded to the nearest 30 seconds.

The entity making the connection then encrypts the hash with their private key, turns that into a hex string and uses that string as the second parameter to the new SMTP TECS command.

For example, an entity with the unique identifier 1b46ef4 might sign a particular connection like this:

TECS 1b46ef3d 5dde82a341863c87be1258c02ce7f80bf214192b

to which the receiving server could reply 200 OK if the signature is good (which they verify by generating the same hash and decrypting using the entity's public key), or with an error if the signature is bad (and they should probably drop the connection).

To get the entity's public key the receiving server needs to query the reputation authority.

Reputation Authority

The TECS reputation authority would be a non-profit organization that sells public/private key pairs and allocates entity IDs to verified entities. Money gathered from selling keys would be used to maintain the database of reputation information for each entity, and in ensuring the only reputable entities can obtain keys.

In the example above the receiving server would query the DNS TXT record of the domain name produced by concatenating identifier given in the TECS command with the name of the authority. Suppose that the authority was tecs.jgc.org then a DNS TXT query would go to 1b46ef3d.tecs.jgc.org.

The reply would consist of the ascii-armored public key for that entity and a reputation measure indicating the reliability of that user. The reputation measure would take one of 4 states: unknown (a recently issued key would not have any reputation), good (only a small number of complaints against this ID), medium (some complaints), bad (large number of complaints, probable spam source). The receiving server can verify the signature and use the reputation information to decide on the handling of the connection.

The authority would accept ARF formatted complaints consisting of abusive messages giving connection information, and the full text of the TECS command. They would then investigate to ensure that the reputation database contained up to date and useful information.

How much is a key pair going to cost?
I think it should be cheap for individuals ($25?), fairly cheap for non-profits and charities ($100?), and then a sliding scale for for-profit companies based on size (say $100 for a small company, $1000 for a big one?). The goal would be to make enough money to run the list.

What about mailing lists that forward mail?
By signing their connections they take responsibility for the mails they are sending. So mailing lists would need to have appropriate email policies in place for unsubscriptions, and deal themselves with spam to the list. Since the connection is signed any concern about munging of From: addresses for VERP handling, or adding headers/footers to email are irrelevant.

Is this compatible with SPF, Sender-ID, DomainKeys?
They are orthogonal. There's no direct interaction.

Will this reduce spam?
I'm not going to make any predictions. The goal would be to build a database that makes it easier to recognize someone who is legitimate, and scrutinize those who abuse the system or who choose not to sign.

What about anonymity?
Anoymous remailers are unaffected. They could sign their outbound connections with the system but that would not affect any changes they make to anonymize messages since its the conneciton, not the message content that's signed.

What if I change the mail servers or IP addresses I am using?
There's no effect. Keep signing the connections and you can take responsibility for any IP address you want to.

I think you are wrong, right, stupid, a genius.
Please comment here, or write to me directly.

Labels:

Friday, January 19, 2007

SpamOrHam shut down

Today I shut down SpamOrHam after a total of 357,380 messages were examined by volunteers around the world in 9 months. (If this is the first time you are hearing about SpamOrHam then read this).

At the same time, I'm happy to annouce that the associated competition was won by Alan Wylie in the UK. He clicked through 456 messages in a row without once disagreeing with the corpus gold standard classification. His prize is on its way to him tod