Wednesday, September 30, 2009

Spam and Google Wave

After a bunch of Googling around I can find very, very little information on how Google Wave intends to handle the spam problem. A search for 'spam' on the Wave Protocol site yields no results at all. Searching the Google Wave API group for spam yields six unhelpful results. A search of the Wave Protocol group yields a single discussion with eight posts.

In a discussion reported on TechCrunch there was a mention of using whitelisting for spam control in Google Wave:

Q: This seems like this will replace email –but can it really replace all we love about email?

Lars: We think of email as an incredibly successful protocol. Google Wave is our suggestion for how this could work better. You can certainly store your own copy by way of the APIs and with the extensions. The model for ownership — it’s a shared object, so how do you delete the object? Even though it’s a shared object, no one can take it away from you without your consent. There will eventually be reversion to sync up with the cloud or you own servers. We’re not planning on having spam in wave (laughs). Early on in email, spam wasn’t really taken into account, so we benefit from that learning experience. We’re planning on a feature so that you can’t add me to your Wave without being on a white list.

Well, whitelisting doesn't work because people need to receive unsolicited messages from people they don't know. For example, I get lots of messages about my book, or my open source software. I can't whitelist those people before they contact me.

And it's not just me, but businesses need to receive unsolicited mail from potential customers (or even their own customers). Whitelisting simply doesn't work.

Having dealt with spam for a long time, they are going to have to come up with a better answer than that. Otherwise a botnet master is going to run a wave server on every bot and started posting spam waves (or worse, waves that appear legitimate and turn into spam waves) to everyone.

I suspect the answer is that it turns out to be the same mix of technologies used for email spam: messsage hashing, content analysis, sender reputation, IP blacklisting, ...

Labels:

Friday, September 25, 2009

POPFile v1.1.1

The cool team that manages the POPFile project (that I started what seems like years ago...) have just released v1.1.1 with a bunch of improvements (especially for Windows users).

From the release notes:


1. New features

You can now customize Subject Header modification placement (head or tail)
by changing the new option 'bayes_subject_mod_pos'. (ticket #74)

NNTP module now caches articles received with the message number specified.

You can now jump to message header/message body/quick magnets/scores in the
single message view by clicking links on the head of the page. (ticket #77)

You can now filter messages shown in the history using 'reclassified' option.
(ticket #67)


2. Windows version improvements

The minimal Perl has been updated to the most recent 5.8 release. Since this
release of Perl only officially supports Windows 2000 or later POPFile 1.1.1
may not work on Windows 95, Windows 98, Windows Millennium or Windows NT. The
installer will display a warning message explaining that POPFile may not work
properly on these old systems.

The Windows system tray icon's menu now offers options to visit the support
website and check for new versions of POPFile.

If the automatic version check feature has been turned on (via the Security tab
in the User Interface) then the system tray icon will change and a message box
will be displayed. This check is performed once per day.

Now that all known problems with the system tray icon have been fixed it will
be enabled by default in new installations. (ticket #106)

The Windows installer now preselects the relevant components when upgrading or
modifying an existing installation. (tickets #13 and #26)

The Windows installer can now display the UI properly even if the database is
very large (tens of MB). (ticket #109)

Fixed a problem that POPFile does not work on Japanese Windows when the path
of the data directory contains non-ASCII characters (e.g. the user name is
written in Japanese). (ticket #111)

The installer is now compatible with Windows 7.


3. Mac OS X version improvements

The installer for Mac OS X 10.6 (Snow Leopard) has come.
Since Snow Leopard includes Perl v5.10.0, the Perl modules which are supplied
with the POPFile installer v1.1.0 or earlier aren't compatible with it.

Starting with this version, two versions of installer will be released.
One is for Snow Leopard, and another is for the former versions of Mac OS X.
The name of Snow Leopard installer will have '-sl' suffix.


4. Other improvements

The users who are using very large database (tens of MB) will be able to
reclassify messages faster. (ticket #108)

Labels: ,

Tuesday, July 14, 2009

How to despam Twitter

Here's how I would despam Twitter:

1. A network of honeypot Twitter accounts. I set up the simplest of all honeypot accounts on Twitter and it has 14 followers. With something more sophisticated you'd catch many more.

2. A Report Spam button. Let anyone report spam from the public timeline. Sending to @spam is just too hard.

3. Integrated SURBL/URIBL/anti-phishing look ups. Expand URL shortener links and perform blacklist checks. In doing this the system can go back and look at tweets after they are posted (long after if necessary) to remove them. Unlike email spam can be cleaned up over time.

4. Look for tweets containing multiple terms from the trending topics. These are almost certainly spams.

5. IP address checks. Use SpamHaus to look for messages coming from known bad networks. Keep track of IP addresses associated with Twitter spam.

6. Machine Learning. All of the above, plus the tweet text can be fed to something like POPFile for a decision.

7. Quiet spam removal. Messages that are considered spam should not be deleted. The links they contain should be disabled (no href) until the person responsible for the tweet complains.

Labels:

Tuesday, June 16, 2009

The Billionaire Donation

Yesterday there was a post on Hacker News about how little money people who make donation-ware WordPress plugins actual end up getting.

Almost nine years ago I released my own donation-ware project called POPFile. It's an GNU GPL licensed email sorting program that uses Naive Bayes to do automatic sorting and spam removal. During 2003 and 2004 it was very popular.

One way of supporting POPFile was to make donations to my PayPal account and over the years people did make donations: 353 in all. The average donation size was $16.39 and I received a total of $5,784.95 (which works out to $74.17 per month). The following chart shows the donations received per month.



If I take SourceForge's numbers as accurate and representative of the total number of POPFile downloads then we have 928,800 downloads which means that 0.038% of downloads resulted in a donation. Or, put differently, a single download was worth $0.006.

One day in 2003 I received a donation from a billionaire. This person, who I'll call simply J. Doe, sent me $25 via PayPal

You've Got Cash!

Dear John Graham-Cumming,

J. Doe just sent you money with PayPal.
J. Doe is a Verified buyer.

Payment Details

Amount: $25.00
Subject: POPfile donation
Note: Thanks for a great product, keep up the good work!

As I did for every single donation I received I replied with thanks:

Thanks for the donation. Glad to hear that POPFile is working out for you; are you just using it for spam filtering or something more?

And J. Doe replied:

Actually, I have 20 buckets for various topics I receive e-mail related to. One of them is spam, obviously. And I run multiple e-mail accounts through the system.

I'm also doing something potentially interesting, but a major hack: some of my accounts use APOP, so I'm using the hacked version I found in the forums. But the non-APOP accounts then don't work with the same instance of POPfile, using the Mac's mail program -- it always uses APOP if the greeter gives an APOP timestamp, even if you tell it not to. So I run a second instance of POPfile, and symlink the corpus to the first instance.

Kind of strange and bizarre, but it works for now.

I know it isn't trivial, but are you planning on adding support for SSL?

And we bounced back and forth emails for a while. And J. Doe ended up telling me that POPFile had 'saved' an email address that had been public years and wanted to continue to use.

But J. Doe only sent me $25. J. Doe probably could have afforded to send $250 or $2,500. But J. Doe sent $25.

This is entirely because I set the price of POPFile at $0. It's free. Donations are purely altruistic. J. Doe got nothing more from me than anyone else who's emailed me about POPFile over the years. And J. Doe even understood that he'd got a large amount of value from POPFile.

If you choose to do donation-ware you need to realize that almost no one donates. You are making a choice to give away your software and need to treat every donation as what it is: an unexpected gift.

If you want to make a living forget about donations and sell your software. Sell support for your software. Make it your living.

If I really wanted to get J. Doe's money I could have made POPFile closed source, I could have gone and sold the product. I could have made the case for how much saving that email address was worth and I could have charged J. Doe a lot more than $25.

But that's a whole different ball game; that's business.

Labels: ,

Friday, December 12, 2008

POPFile v1.1.0 Released

There's a new POPFile out (v1.1.0) and I had almost nothing to do with it. This is the first release where the new (global) POPFile Core Team did all the work. Thanks Brian in the UK, Joseph in the US, Manni in Germany and Naoki in Japan. A truly global effort.

As part of the v1.1.0 release POPFile has moved from SourceForge to its own server and has a totally new web site.

v1.1.0 also includes some great new features: it is the first to use a
SQLite 3.x database and it is the first to offer a Mac OS X installer in addition
to the usual cross-platform and Windows installer versions.

And there are a raft of bug fixes as well which you can read about in the release notes.

Labels:

Thursday, October 09, 2008

Phishers are using economic problems to catch the unwary

It's hardly surprising, at least to anyone who's spent time looking at phishing scams, but the recent economic turmoil has led phishers to get creative. Here's an example email that preys on the unwary by exploiting the Wachovia/Citibank merger.



Once you visit the site you are asked to download an executable (which actually starts automatically downloading via an automatic refresh after 15 seconds.



The executable contains a nasty piece of work: Mal-EncPk/BU.



I didn't go further and actually unpack the executable to find out what kind of nastiness, but there's plenty of it to do around.

Labels:

Monday, May 26, 2008

POPFile v1.0.1 released plus a glimpse of the future

POPFile v1.0.1 was released today; this is the first ever POPFile release that I didn't do. POPFile is now being managed by a core team of developers: Manni Heumann (in Germany), Brian Smith (in the UK), me (in France), Joseph Connors (in the US) and Naoki Iimura (in Japan). A truly international effort. The actual release binaries were built by Brian Smith who, for a long time, has been the installer guru.

This release contains minor feature improvements and a number of bug fixes. Some of the bugs fixes were for annoying bugs that showed up only occasionally: that makes it a worthwhile upgrade.

Since I pulled back from being involved in every detail of POPFile's evolution the core team has been liberated to work on the project. v1.0.1 is their first release, and it is minor, but much greater things are coming:

1. A native Mac installer

2. A SOHO version of POPFile. Some time ago I did most, but not all, of the work to make a multi-user version of POPFile. That work is being completed by the core team and will allow a single POPFile installation to be shared by multiple users.

Thank you to the POPFile Core Team for this great start to a new chapter in POPFile history.

Labels:

Monday, May 19, 2008

A post (anti-spam-) retirement note

One of the anti-spam companies I was/am involved with, MailChannels, made an interesting announcement recently about a commercial offering for SpamAssassin. What makes the announcement interesting to me is that Justin Mason (who wrote SpamAssassin) is also an advisor to MailChannels.

The program, Traffic Control 3 for SpamAssassin, is a free download and for sites that process less than 10,000 messages per day there's no charge at all (and no need to go and get a license from MailChannels).

Basically, the new product acts as a front-end to SpamAssassin traffic shaping incoming messages so that load is taken off SpamAssassin and the mail server.

Labels:

Thursday, May 01, 2008

The Spammers' Compendium finds a new home

Shortly after I announced that I was getting out of anti-spam the folks at Virus Bulletin contacted me about taking over The Spammers' Compendium. I was delighted.

Today the transfer is complete and the new home is here. It will be maintained and updated by Virus Bulletin. Please send submissions to them.

Labels:

Monday, March 31, 2008

Multi-route (email and phone) self-aware phishing

Today, I received the following email:

This communication was sent to safeguard your account against any
unauthorized activity.

Max Federal Credit Union is aware of new phishing e-mails
that are circulating. These e-mails request consumers to click
a link due to a compromise of a credit card account.

You should not respond to this message.

For your security we have deactivate your card.

How to activate your card

Call +1 (800)-xxx-9629

Our automated system allows you to quickly activate your card

Card activation will take approximately one minute to complete.


Of course, I don't have an account with Max Federal Credit Union and this is obviously a phish. Notice that the English is quite right:

"For your security we have deactivate your card." and "You should not respond to this message." doesn't make sense in context.

What's more interesting is that the message itself warns you about phishing emails and asks you to call an 800 number.

If you call the 800 number an electronic voice reminds you again to never give your PIN, password or SSN in email and then proceeds to ask you for the card number, PIN, expiry date and CVV2. The assumption is that you've been warned twice not to do something in email, so it's OK by phone.

It's painful to see the phisher use the existence of phishing as a way to phish.

Labels:

Tuesday, March 25, 2008

"Retiring" from anti-spam

Today, I'm "retiring" from anti-spam work. Practically, that means the following:
  • No more updates to The Spammers' Compendium or Anti-spam Tool League Table pages. These remain on line, but are not being maintained.
  • I'm looking for a new leader for the POPFile project.
  • I'm no longer active on any anti-spam mailing lists.
  • I am leaving all anti-spam conference committees.
  • My anti-spam newsletter is no longer being published.
I will, however, be continuing with commercial anti-spam work where I have agreements currently in place with customers. No change to their support, terms or assistance.

The obvious question is why? For me, the interest just isn't there. The battle against spam continues but is now about trench warfare rather than creating new weapons. We'll continue to see innovation, but for any hacker it's the new, new thing that's important. For me, spam is yesterday's news. Watching companies squabble and refuse to cooperate, seeing a decline in quality at anti-spam conferences, and major companies essentially killing their consumer anti-spam means anti-spam just isn't where I want to be.

Of course, there are many really good people fighting spam out there. This post isn't meant to demean them.

Thank you to everyone who has supported what I've done over the last 7 years, and good luck!

Labels:

Tuesday, March 11, 2008

First assume all new email is useless

When I download email none of it goes in my Inbox. In fact, I don't have an Inbox. I work on the assumption that all new email is useless.

Many reports tell us that between 80% and 90% of all email is spam, so for starters only 10% to 20% is at all likely to be useful. Then, if you account for being on mailing lists, being CC:ed needlessly and receiving automatic updates such as order confirmations from Amazon.com, you'll see that almost all email is useless. Only a tiny fraction of the mail you receive is useful. And by useful I mean requiring action.

I use Thunderbird and my email folder structure looks like this:



When email arrives it is automatically sorted using POPFile into the folders: Family, GNU Make, Misc, polymail, POPFile and Spam. These six folders are the categories of mail that I receive:

  • POPFile: Since I wrote POPFile I get lots of mail about it and I use this is a general box for other open source projects I work on and anything else about anti-spam
  • polymail: Anything to do with my commercial product polymail and my consulting business
  • GNU Make: Anything to do with GNU Make or the company, Electric Cloud, that I co-founded
  • Family: Anything from my family
  • Misc: Order confirmations, airline tickets, PayPal statements, etc.
  • Spam: spam

POPFile uses Naive Bayesian text classification to automatically sort my email (with just a point and click interface for training) and then six rules (which never need updating) move the incoming mail based on POPFile's classification to one of those folders.

Of course, POPFile can be used to sort mail in any way you choose: my categories are unlikely to be yours. You might use POPFile to sort Work from Home from Spam. At least one journalist I know uses POPFile to sort Interesting from Boring from Spam so that he only gets to read interesting press releases.

When I identify mail that does need action taken I move it to the ACTION folder (which is the closest I've got to an Inbox). Moving mail there is a snap because I use the QuickMove extension for Thunderbird and have ALT-number keys mapped to each folder: one key press and the message is moved into or out of ACTION.

To keep on top of things I publish the number of items in my ACTION folder on my web site. Here's a live view over the last 24 hours. Currently, 9 items need dealing with.



My rules for managing email:

  1. Assume that all new email is useless
  2. Automatically sort email into folders on delivery
  3. Take control of your Inbox: only you put email in it

Labels: ,

Monday, March 03, 2008

To the idiotic spammer posting comment spam on this site

Since your name is two Chinese characters I'm going to address you as "Dude".

Dude,

Lately you've been posting comment spam on my blog for your World of Warcraft Gold. This is a little silly:

1. I'm fairly well known in anti-spam circles, did you really think I was going to let comment spam through on this site?

2. Comment moderation is turned on on this site. So your comment spam goes nowhere when I click the Discard button.

3. There has been a little some collateral damage from your World of Warcraft spamming. I accidentally killed two comments by Hypermechanic and I can't retrieve them. He/she wanted to say something useful about an old post:


You could do that like cameroid.com .
I guess in JAVA or .NET.

and

Cool I will hunt for it… This is a very sweet look app you have here. Even though you down play your role this is still brilliant.

Thank you for something new and useful.

Labels:

Wednesday, February 06, 2008

A clever, targeted email/web scam with a nasty sting

Steve Kirsch sent me an interesting message he'd received from [email protected] (i.e. an email address from the Better Business Bureau) containing an apparent complaint from a customer submitted through the BBB. The email itself was actually sent from a BellSouth ADSL line (i.e. almost certainly a zombie machine). The address was not authorized to send as bbb.org according to BBB's SPF records.

But the content of the email message is very interesting. Here's a screenshot:



Notice how the email contains the correct address for Steve, his name and the name of his company and thus appears to be a real complaint. The link below the complaint, where you can get full details, is the first of two nasty stings in this message.

The actual URL is:

http://search.bbb.org/ViewReport.aspx?sess=32b7aa40a693b0296624c132240947d0d
39365d555819ead6b5e59cc7f257bdeb1fc30198daafdfe88937338af1519acbeafcc25e5a3153db
0320ba5d0c0579a775c83632dd36d7971b95d9f85e64bdb&lnk=http%3a%2f%2faltaconsultants.com
%2fcomplaints%2fViewReport.php?case=840915898&biz=&bbb=1186

i.e. the link actually goes to the BBB's own web site (making it seem even more likely that this is a genuine message). The link manipulates the search option on the BBB web site using the lnk parameter to perform a redirect to http://altaconsultants.com/complaints/ViewReport.php?case=840915898 which in turn redirects to http://www.kfsolicitors.com/complaints/ViewReport.php?case=840915898. And it's on that, presumably hacked, site that the real scam starts.

If you are not using Microsoft Internet Explorer you'll be presented with the following web page:



Once you've upgraded you get told that the web site requires the "Adobe Acrobat ActiveX" control and you need to install it.



The control itself is embedded using the following code:

<object classid="clsid:D68E2896-9FD9-4b70-A9AE-CCDF0C321C45" height="0" width="0" codebase="Acrobat.cab"></object>

Notice how instead of pointing to Adobe's web site to get the control it's available locally as Acrobat.cab. So when you follow the instructions you download and install an ActiveX control from the scammer web site.

Once you've done that you get told that in fact the customer has withdrawn their complaint and there's nothing to worry about:



Now for the second sting. There must be something about this ActiveX control that's malicious... the scammer didn't go to all that trouble for nothing. But none of the current anti-virus programs report any problems with the file.

For example, my Sophos anti-virus says nothing, and online scanners such as Kaspersky's say that it's clean:



So, perhaps the file really is clean, but I suspect that this is a new threat which isn't currently detected by anti-virus. I'll post again when I get a response from Sophos' anti-virus brainiacs. Perhaps, I'm wrong but be very wary of these mails.

Further information about BBB related scams on their web site.

UPDATE: McAfee WebImmune tells me that this is a new detection of the spy-agent.cf SpyWare which steals information about your web surfing.

UPDATE: A scan using VirusTotal shows that very few anti-virus programs are detecting this (although their version of Kaspersky is finding it---curious that the online Kaspersky scanner does not).

Labels: ,

Thursday, October 18, 2007

Times Square: a fun spammer GIF

Nick FitzGerald reported a neat spammer image trick to me the other day. It's entered in The Spammers' Compendium that involves using animation to display the word Viagra emulating a flashing neon sign.

Since many OCR systems merge the layers together before OCR this image is actually in the 'wrong' order. Once merged the letters are in the order VIRAAG.

Labels:

Thursday, October 11, 2007

More spammer crossword creativity

Nick FitzGerald writes in with a variant of the "1 across, 3 down" spammer content trick which looks like this:



The neat thing is that the crossword is created using HTML in a way that prevents a simple HTML-stripping spam filter from reading the brand names. To a simple spam filter this looks like:

CA
BREIT
OM
R
O
L
E
TIER
ING
G
X

The actual HTML (cleaned up by Nick) is:

<TABLE>
<TR>
<TD>
<DIV align=right>
CA<BR>
<BR>
BREIT<BR>
OM
</DIV>
</TD>
<TD>
<DIV align=center>
R<BR>
O<BR>
L<BR>
E
</DIV>
</TD>
<TD>
TIER<BR>
<BR>
ING<BR>
GA
</TD>
</TR>
<TR>
<TD>
<DIV align=center>
X
</DIV>
</TD>
</TR>
</TABLE>

Labels:

Friday, August 31, 2007

Useful sources of messages for testing spam filters

- Enron Corpus

- PU corpus

- SpamAssassin public corpus

- TREC 2005 Public Spam Corpus

- TREC 2006 Spam Track Public Corpora

- 20 Newsgroups

Labels:

Saturday, July 14, 2007

Last push for POPFile voting

POPFile has been nominated for a SourceForge Community Choice Award due to the efforts of many people.

Voting closes on July 20 and there's strong competition in POPFile's category from the likes of Pidgin (formerly GAIM) and phpBB.

If POPFile wins SourceForge will be making a donation to a charity that I picked: Doctors without Borders.

If you feel like voting for POPFile, please vote in the Best Project for Communications category here:

http://sourceforge.net/awards/cca/vote.php

Labels:

Tuesday, July 03, 2007

Please vote for POPFile

POPFile has been nominated for a SourceForge Community Choice Award through the efforts of its users.

Now it's time to vote.

If you love POPFile, please vote for it in the Best Project for Communications category.

Labels: ,

Tuesday, June 26, 2007

Pretty Darn Fancy: Even More Fancy Spam

Looks like the PDF wave of spam is proceeding with a totally different tack. Sorin Mustaca sent me a PDF file that was attached to a pump and dump spam. Unlike the previous incarnation of PDF spams, this one looks a lot like a 'classic' image-spam.



It's some text (which has been misaligned to fool OCR systems), but there's a little twist. Zooming in on a portion of the spam shows that the letters are actually made up of many different colors (a look inside the PDF it's actually an image.



I assume that the colors and the font misalignement is all there to make it difficult to OCR the text (and wrapping it in a PDF will slow down some filters).

Extracting the image and passing it through gocr gave the following result:

_ Ti_ I_ F_ _ Clih! %
ogx.

%_ % _c. (sR%)
t0.42 N %x

g0 _j_ __ h_ climb mis _k
d% % %g g nle_ Frj%.
_irgs__Dw_s _ rei_ � a
f%%d __. n_ia une jg gtill
oo_i_. Ib _ _ _ _ _ _ 90I
Tgdyl

And a run through tesseract resulted in:

9{Eg Takes I:-west.tY`S Far Gawd Climb! UP
fatued Sta=:khlauih. This we 1: still

Close, but no cigar.

Labels:

Thursday, June 21, 2007

Pretty Darn Fancy: Stock spammers using PDF files

Joe Chongq sent me a fascinating spam that is the first I've seen that's using a PDF file to send its information. I've long predicted that we'll see a wave of complex formats used for spam as broadband penetration increases and sending large spams becomes possible.

This particular spam has a couple of interesting attributes:

1. The PDF file itself is a really nicely formatted report about a particular stock that's being pumped'n'dumped.

2. The file name of the PDF was chosen to entice the user further by using their first name. In this case it was called joe_report.pdf.

3. The PDF is compressed using the standard PDF 'Flate' algorithm and totals 84,398 bytes. That's fairly large, but we've certainly seen image spams that were larger. Use of compression here means that a spam filter that's not aware of PDF formats would be unable to read the message content.

Here's what the actual PDF looks like (click for a larger view):

.

Labels:

Wednesday, June 20, 2007

jeaig (jgc's email address image generator) launches

Today is launch day for a simple little web site called jgc's email address image generator. It's a web service that enables anyone to generate a CAPTCHA like image containing their email address for insertion on their web site.

Since web crawlers are currently unable to look inside images to scrape email address this means that your email address will not be scraped from a web site, but can be written down by a human.

Here's an example email address:


Made using jeaig

and the code that generates that is:

<img src="http://jeaig.org/image/UmFuZG9tSVacujftR-zi3H6s63Vd-
6iIaBfIo-zwZhTmeKRTo4CL818NTBYEAUT4sFRAFiQhbkzU*tHehFDLYniqtHVc3CTr.png">
<br />
<font size="-2">Made using <a href="http://jeaig.org/">jeaig</a></font>

The server does not store the email address: when a user enters an email address on the site it is padded with a random amount of random data (both before and after the address) with randomness being supplied by /dev/urandom. Then the address is encrypted using Blowfish using a secret key known only to the jeaig server (this key was also generated from /dev/urandom) and a random IV is chosen.

The encrypted data is then modified base-64 encoded so that it can be used in the URL.

When the image is requested using the special base-64 filename the email address is decrypted and then rendered using CAPTCHA code to produce the image. The image changes each time the email address is loaded.

And, yes, this is a free service.

Labels: ,

Monday, June 18, 2007

POPFile v0.22.5 Released

Here are the details:

Welcome to POPFile v0.22.5

This version is a bug fix and minor feature release that's been over a
year in the making (mostly due to me being preoccupied by other
things).


NOMINATING POPFILE FOR AN AWARD

SourceForge has announced their 'Community Choice Awards' for 2007 and
is looking for nominations. If you feel that POPFile deserves such an
honour please visit the following link and nominate POPFile in the
'Best Project for Communications' category. POPFile requires multiple
nominations (i.e. as many people as possible) to get into the list of
finalists.

http://sourceforge.net/awards/cca/nomination.php?group_id=63137

Thanks!


WHAT'S CHANGED SINCE v0.22.4

1. POPFile now defaults to using SQLite2 (the Windows installer will
convert existing installations to use SQLite2).

2. Various improvements to the handling of Japanese messages and
improvements for the 'Nihongo' environment:

Performance enhancement for converting character encoding by
skipping conversion which was not needed. Fix a bug where the
wrong character set was sometimes used when the charset was not
defined in the mail header. Fix a bug where several HTML
entities caused misclassification. Avoid a warning 'uninitialized
value'. Fix a bug that the word's links in the bucket's detail
page were not url-encoded.

3. Single Message View now has a link to 'download' the message from
the message history folder.

4. Log file now indicates when an SSL connection is being made to the
mail server.

5. A number of small bug fixes to the POPFile IMAP interface.

6. Installer improvements:

Email reconfiguration no longer assumes Outlook Express is
present. Add/Remove Programs entries for POPFile now show a more
realistic estimate of the program and data size. Better support
for proxies when downloading the SSL Support files. The SSL
patches are no longer 'hard-coded', they are downloaded at
install time. This will make it easier to respond to future
changes to the SSL Support files. The Message Capture Utility
now has a Start Menu shortcut to make it easier to use the
utility. The minimal Perl has been updated to the latest
version. The installer package has been updated to make it work
better on Windows Vista (but further improvements are still
required).


WHERE TO DOWNLOAD

http://getpopfile.org/wiki/download


GETTING STARTED WITH POPFILE

An introduction to installing and using POPFile can be found in the
QuickStart guide:

http://getpopfile.org/wiki/QuickStart


SSL SUPPORT IN WINDOWS

SSL Support is offered as one of the optional components by the
installer. If the SSL Support option is selected the installer will
download the necessary files during installation.

If SSL support is not selected when installing (or upgrading) POPFile
or if the installer was unable to download all of the SSL files then
the command

setup.exe /SSL

can be used to run the installer again in a special mode which will
only add SSL support to an existing installation.


CROSS PLATFORM VERSION KNOWN ISSUES

The current version of SQLite (v3.x) is not compatible with POPFile.
You must use DBD:SQLite2 to access the database.

Users of SSL on non-Windows platforms should NOT use IO::Socket::SSL
v0.97 or v0.99. They are known to be incompatible with POPFile; v1.07
is the most recent release of IO::Socket::SSL that works correctly.


v0.22.0 RELEASE NOTES

If you are upgrading from pre-v0.22.0 please read the v0.22.0 release
notes for much more information:

http://getpopfile.org/wiki/ReleaseNotes


DONATIONS

Thank you to everyone who has clicked the Donate! button and donated
their hard earned cash to me in support of POPFile. Thank you also to
the people who have contributed their time through patches, feature
requests, bug reports, user support and translations.

http://sourceforge.net/forum/forum.php?forum_id=213876


THANKS

Big thanks to all who've contributed to POPFile over the last year.

John.

Labels: ,

Tuesday, June 05, 2007

A little light relief in an 'enlargement' spam

Jason Steer from IronPort sent me over an image taken from a recent spam for an 'enlargement' product. Here's the image:



Since the text is large I thought it would be fun to run this through gocr to OCR out the text and URL. Here's the output of gocr on the image (I removed a few blanks lines for clarity here):

(PICTURE)
__%[0______
__ ___


______
_ ____

So, that worked out well :-) Nevertheless, the domain is listed in the SURBL:

$ dig relies.net.multi.surbl.org
;; QUESTION SECTION:
;relies.net.multi.surbl.org. IN A

;; ANSWER SECTION:
relies.net.multi.surbl.org. 2100 IN A 127.0.0.4

So, if you can extract the domain name from the image it's possible to check it against the SURBL and blacklist the message. Switching over to Google's Tesseract OCR system revealed the following:

(I LIT inl 3
IE\)' :
Q ii ,g @1

Lgiiiizj
i
H ; ik
s$i` wg `i?! J

Labels:

Friday, May 25, 2007

Back from the EU Spam Symposium; here's my talk

So I'm back home from the 2007 EU Spam Symposium which was held in Vienna in Austria and you can grab my presentation here. You'll notice that the presentation template is from MailChannels. They very kindly sponsored my trip to Vienna and so I did a little publicity for them. There's only one slide, however, that's actually anything to do with MailChannels in the entire presentation, so don't expect a product pitch!

One thing I didn't mention in my talk was that as the number of Internet hosts expands and the number of broadband subscribers grows the number of competing botnets can also grow. That means I'd expect to see the price of botnet rental dropping as the Internet grows leading to lower costs for spammers.

I'll give a complete round up of the conference in my newsletter next week, but overall there were some interesting talks, and meeting some people like Richard Cox from SpamHaus and Richard Clayton was very useful.

Labels:

Tuesday, May 15, 2007

Some architectural details of Signal Spam

Finally, Signal Spam, France's new national anti-spam system, launched and I'm able to talk about it. For a brief introduction in English start here.

I'm not responsible for the idea behind Signal Spam, nor for its organization, but I did write almost all the code used to run the site and the back end system. This blog post talks a little bit about the design of Signal Spam.

Signal Spam lets people send spams via either a web form, or a plug-in. Plug-ins are currently available for Outlook 2003, Outlook 2007 and Thunderbird 2.0; more coming. Currently Signal Spam does three things with every message: it keeps a copy in a database after having extracted information from the body and headers of the message; it figures out if the message came from an ISP in France and if so sends an automatic message to the ISP indicating that they've got a spammer or zombie in their network; it figures out if the message was actually a legitimate e-marketing message from a French mailer and informs the person reporting the spam of how to unsubscribe.

The original plan was that the system be capable of handling 1,000,000 messages per day allowing for peaks of up to 1000 messages per minute (such as when people first come to work in the morning) and that messages would be handled in near real-time (i.e. the time from a message being received by the system to it being analyzed and forwarded to an ISP would be under 60 seconds). Signal Spam also wanted a lot of flexibility in being able to scale the system up as use of the site grew and being able to do maintenance of the site without taking it down.

Here's the last 12 hours of activity on the site, which pretty much matches what we expected with a peak once people get to work and start reading their mail. (These charts are produced automatically by the lovely RRDTool.)



The system I built is split into two parts: the front end (everything that the general public sees including the API used by the plug-ins) and the back end (the actual database storing the messages sent, the software that does analysis and the administration interface). Communication between the front end and the back end uses a web service running over HTTPS.

To make things scale easily the back end is entirely organized around a simple queuing system. When a message arrives from a user it is immediately stored in the database (there are, in fact, two tables: one table contains the actual complete message as a BLOB and the other contains the fields extracted from the message. The messages have the same ID in each table and the non-BLOB table is used for searching and sorting).

Once stored in the database the message ID is added to a FIFO queue (which is actually implemented as a database table). An arbitrary number of processus handle message analysis by dequeuing IDs from the FIFO queue (using row-level locking so that only one process gets each ID). Once dequeued the message is analyzed: fields such as From, Subject, Date are extracted and stored in the database, the Received headers are walked using a combination of blacklist lookup and forgery detection to find the true IP address that injected the message into Internet, the IP address is mapped to the network that manages the IP address, fingerprints of the message are taken and all URLs inside the message are extracted.

Once the analysis is complete the process decides whether the message needs to be sent to an ISP. If so it enqueues the message ID on another FIFO queue for a separate forwarding process to handle. If the message is in fact a legitimate message then the message ID is enqueued on a FIFO queue for another response process to handle.

The forwarding process generates an ARF message to the appropriate ISP and sends the message for every ID that it dequeues from the queue using VERP for bounce or reponse handling.

The response process dequeues IDs and responsed to the original reporter of the spam with a message tailored for the specific e-marketer with full unsubscribe details.

The use of queues and a shared database to handle the queues, plus a simple locking strategy means that arbitrary numbers of processes can be added to handle the load on the system as required (currently there is only one process of each type running and handling all messages in the delay required). It also means that the processus do not need to be on the same machine and the system can scale by adding processus or adding hardware.

Stopping the processes does not stop the front end from operating. Messages will still be added to the database and the analysis queue will grow. In fact, the length of the queue makes measuring the health of the system trivial: just look at the length of the queue to see if we are keeping up or not.

Since the queue has the knowledge about the work to be done processus can be stopped and upgraded as needed without taking the system off line.

To hide all this the entire system (which is written in Perl---in fact, the back end is entirely LAMP) uses an object structure. For example, creating the Message object (passing the raw message into the constructor) performs the initial message analysis and queues the message for further analysis. Access to the database queues is entirely wrapped in a Queue object (constructor takes the queue name). These objects are dynamically loaded by Perl and can be upgraded as needed.

Finally, all the objects (and related scripts) have unit tests using Perl's Test::Class and the entire system can be tested with a quick 'make test'. One complexity is that most of the classes require access to the database. To work around this I have created a Test::Database class that is capable of setting up a complete MySQL system from scratch (assuming MySQL is currently installed) and loading the right schema, that is totally independent of any other MySQL instance. The class then returns a handle (DBI) to that instance plus a connect string. This means the unit tests are totally independent of having a running database.

In addition, the unit tests include a system that I created for POPFile which allows me to get line-by-line coverage information showing what's tested and what's not. By running 'make coverage' it's possible to run the test suite with coverage information. This gives percentage of lines tested and for every Perl module, class and script a corresponding HTML file is generated with lines colored green (if they were executed during testing) or red (if not). The coloring is achieved by hooking the Perl debugger (see this module from POPFile for details).

Here's an example of what that looks like (here I'm showing Classifier/Bayes.html which corresponds to Classifier/Bayes.pm in the POPFile code, but it looks the same in Signal Spam):



The green lines (with line numbers added) were executed by the test suite; the red line was not (you can see here that my test suite for POPFile didn't test the possibility that the database connect would fail).

Labels: ,

Perhaps OCRing image spams really is working?

I've previously been skeptical of the idea that OCRing image spams was a worthwhile effort because of the variety of image-obfuscation techniques that spammers had taken to using.

But Nick FitzGerald has recently sent me an example of an image spam that seems to indicate that spammers are concerned about the effectiveness of OCR. Here's the image:



What's striking is that the spammer has used the same content-obscuring tricks that we've seen with text (e.g. Viagra has become [email protected]@), perhaps out of fear that the OCRing of images is working and revealing the text within the images.

Or perhaps this spammer is just really paranoid.

Labels:

Tuesday, April 17, 2007

No newsletter for April 15, 2007

This blog post is really for people who read my spam and anti-spam newsletter. There won't be one this week (it was due on April 15) because I wanted to spend time in the newsletter reviewing the MIT Spam Conference 2007.

Unfortunately, the videos of the presentations on the web site are either of very poor quality or have no sound at all. This makes reviewing the presentations very hard. Bill Yerazunis has promised better videos 'coming soon', and I'm waiting for them.

While I'm complaining I also find that the 'download an ISO' to get the papers to be ridiculous. I've asked Bill to provide individual links to each of the papers and presentations so that people can just click and download what they want. He says...

We *could* have individual links. I assume the
individual authors would do so in any case.

And I'll probably do that later.

But right now, I'm sorta trying to get people to
at least glance at all the papers, and put the CDROMs into
their local libraries; that's the motivation.

So, my next newsletter will be out on April 30 with a review of the MIT Spam Conference and other regular news.

Labels:

Thursday, March 15, 2007

Calibrating a machine learning-based spam filter

I've been reading up about calibration of text classifiers, and I recommend a few papers to get you started:

The overall idea is that the scores output by a classifier need to be calibrated so that they can be understood. And, specifically, if you want to understand them as a probability that a document falls into a particular category then calibration gives you a way to estimate the probability from the scores.

The simplest technique is bucketing or binning. Suppose that the classifier outputs a score s(x) in the range [0,1) for an input document x. Once a classifier is trained it's possible to calibrating it by classifing known documents, recording the output scores and then for each of a set of ranges (e.g. divide [0,1) into 10 ranges [0.0,0.1), [0.1,0.2) and so on) count the number of documents in each class, and get a fraction the represents the probability within that range. Then when you classify an unknown document you look at the range in which its score falls to get probability estimates.

For spam filtering each range would consist of the number of emails that are ham in the range, and the number that are spam. The ham probability then just comes out to # of ham emails in the range / total number of emails in the range.

I decided to run a little test of this and took the complete TREC 2005 Public Spam Corpus and trained a Gary Robinson style spam filter on it. Then I classified the entire corpus to get calibration data. Instead of looking at the final scores, I looked at the pair of scores (one for ham, one for spam) that the classifier generates and used those to generate bins. The following chart shows (crudely) the percentage of spams and hams in each bin:



The x-axis is the ham score in the range [0.0,0.9) with each square representing a single 0.1-width bin. The left-most square means that the classifier had a very low score in terms of hamminess, right-most square means that the classifier had a very high score.

The y-axis is the spam score in the [0.0,0.9), so that the bottom means a low spamminess score and the top a high one. Each square is colored red and green. Red is the percentage of messages in that bin that are spam, and green the percentage of messages in that bin that are ham.

So as you'd expect the top left corner (low ham score, high spam score) is generally colored red indicating that these are spam messages, and the bottom right corner (low spam score, high ham score) is colored green (lots of ham). Some squares are black because there's too little data (I arbitrarily said that if there were less than 5 messages in the bin it was colored black).

But there's a really interesting square in the top right: it's almost all red (in fact, it's 93% red or spam). So we can have confidence that a message there (which corresponds to a final 'score' of 0.5) is in fact spam.

I'm sure that there are other interesting insights to be gained from calibration and I'd like to spend the time to evaluate the effectiveness of using the probabilities generated in this way (instead of the rather arbitrary selection of a score 'cutoff' point) as the way to determine whether a message is spam or ham (or undetermined). For example, the square at (0.9,0.2) (which corresponds to a 'score' of 0.18 is 57% ham, 43% spam so looks like a good candidate for undetermined; it looks like a much better candidate than the typical "area around 0.5" (which corresponds to (0.9,0.9) and is 93% spam).

Labels: ,

An image-based spam that brags about its own delivery

Nick FitzGerald sent me a simple image-based spam for Viagra which starts with a brag on the part of the spammer:



Nick calls it a 'self aware' spam.

Labels:

Wednesday, February 28, 2007

Image spammers doing the twist

It's been quite a while since I last blogged about ever changing image spam. Anna Vlasova wakens me from my unblogging slumber with some great samples of recent image spams were the spammer has decided to rotate the entire image to try to avoid detect. Take a look at this first one:

The spammer has really gone to town here:
  • There's random speckling all over the images to upset hashing and OCR techniques
  • There's no URL in the message itself (it's in the image)
  • The entire image has been rotated to the left to obscure the text

And, of course, they are not going to be content with just one rotation and can randomize the angle per message:

And they've gone even further by slicing the image up, randomizing the angle and overlaying the elements using animation.

Labels:

Wednesday, February 07, 2007

Trusted Email Connection Signing (rev 0.2)

IMPORTANT: This blog post deprecates my previous posting on this subject. The blog post Proposal for connection signing reputation system for email is deprecated.


Sign the medium, not the message



The motivation behind TECS (Trusted Email Connection Signing) is that what managers of MX servers on the public Internet really care about is the ability to distinguish a good connection (coming from a legitimate sender and which will be used to send wanted email) from a bad connection (coming from a spammer). If you can identify a bad connection (today, you do that using an RBL or other reputation service based on the IP address of the sender) you can tarpit or drop it, or subject the mails sent on the connection to extra scrutiny. If you can identify a good connection it can bypass spam checks and help reduce the overall false positive rate.

If you are a legitimate bulk mailer (an email marketer, for example) then you care deeply that you reputation being recognizable and that mail sent from you be delivered. Currently, you have to carefully tend your IP addresses to make sure that they don't appear on blacklists, and you have to ensure that new IP addresses are clean.

If you are running a large email service (e.g. Yahoo! Mail) then you are currently trying to build white and blacklists of IP addresses, when what you really want are white and black lists of entities.

Currently, the options used to identify a bad connection are rather limited (RBLs, paid reputation services and grey listing), and good connections are hard to manage (whitelists on a per-recipient basis, or pay-per-mail services). What's needed is a different approach.

The idea is to identify and determine the reputation of the entity connecting to a mail server in real-time without resorting to a blacklist or whitelist. This is done by signing the connection itself. With the signature on a per-connection basis a mail server is able to determine who is responsible for the connection, and then look up that entity's reputation in a database.

Current reputation databases are based on IP addresses. This is a very inflexible system: IP addresses must be added to blacklists very fast as spammers churn through zombie machines, and any legitimate emailer needs to make sure their mail servers are whitelisting with multiple email providers (e.g. Yahoo!, Gmail, Brightmail, ...) to ensure delivery. And if a legitimate mailer wants to bring on line new servers, with new IP addresses they have to run through the entire whitelisting process again.

This is inefficient. The mapping between IP address and entities (e.g. knowing that Google's Gmail services uses a specific set of IP addresses) is unwieldy to manage and the wrong level of granularity. Google should be free to add and remove email servers at will, while carrying their good reputation with them.

That's what TECS gives you.

Connection Signing

TECS is an extension to the existing SMTP AUTH mechanism (see RFC 2554) and implements an authentication mechanism that I'll refer to as TECS-1 (the 1 here acts as a version number on the protocol). TECS-1 would need to be registered as a SASL (see RFC 2222) authentication mechanism.

When a mail sender connects to an SMTP server wishing to sign its connection it issues the EHLO command and if that SMTP server is capable of handling AUTH the mail sender then signs the connection using the AUTH command, with the TECS-1 mechanism followed by an initial response (which contains the TECS signature) as defined in RFC 2554.

Here's an example session:

S: 220 smtp.example.com ESMTP server ready
C: EHLO jgc.example.com
S: 250-smtp.example.com
S: 250 AUTH CRAM-MD5 DIGEST-MD5 TECS-1
C: AUTH TECS-1 PENCeUxFREJoU0Nnbmh1YWVlMDljNDBhZjJiODRhMGMyYjNiYmFlNzg2ZQ==
S: 235 Authentication successful.

The 'initial response' section of the of the AUTH command is a base-64 encoded string containing the following structure (this is deliberately similar to the DKIM fields):

a=rsa-sha256; q=dns; d=jgc.org; b=oU0Nnbmh1YWVlMDljNDBhZjJiO==

a= is the cryptographic method used (default would be RSA/SHA-256 with suitable padding as described in PKCS#1 version 1.5 RFC 3447).

d= is the name of the domain signing the connection. In the example above I am showing a connection that is being signed (and hence claimed by) jgc.org.

q= is a query type with the default being the use of a DNS TXT record. This query method is used to obtain the public key associated with the signing domain. The public key would be obtained by looking up _tecs.jgc.org and getting the associated TXT record.

b= is the binary signature for the connection generated using the method in a= by the d= domain.

The connecting server signs the tuple consisting of ( destination IP/port, source IP/port and epoch ); that way they sign the current connection and verify that they are responsible for the mail sent across it.

Each entity has an RSA key public/private key pair. When signing a connection the entity generates a SHA-256 hash of the tuple. The destination IP/port pair is the IP address and port on the mail server that the mail sender is currently connected to; similarly the source IP/port pair is the IP address and port of the connection being used by the
mail sender. The epoch is the standard Unix epoch rounded to the nearest 30 seconds.

The entity making the connection then encrypts the hash with their private key.

Update (February 8, 2007): A number of people have suggested getting the public key from the _domainkey (i.e. DKIM) label of the d= domain. This seems like a good idea since there's no need to reinvent the wheel.

Update (February 9, 2007). A few people pointed me in the direction of the MARID CSV proposal (see CSV). I've addressed this below.

I don't want to sign my outbound SMTP connections
Well don't then. If TECS were implemented you'd probably find you'd have trouble getting your mail delivered as non-signing (and hence not-taking-responsibility) would be looked upon very poorly.

I'm big ISP and I don't want to sign as myself, can I sign as a customer?
Sure, for example, jgc.org's mail is actually handled by lists.herald.co.uk. When that server was sending mail for jgc.org it could sign as jgc.org as long as it has access to jgc.org's private key. It would simply specify d=jgc.org in the TECS data.

Why don't you just use STARTTLS with certificates?
Because that's a very heavyweight system, designed for something else. A SASL extension using SMTP AUTH is simple and clean.

Why do you think connection signing is useful?
Because SMTP server resources are precious. Being able to make a decision about a connection before any mail is delivered is very useful. An SMTP server owner could use reputation data from a public or private source to decide whether to accept or reject a connection, slow down a connection, apply little or extra scrunity to a connection, etc. Being able to do this before receiving a ton of mail and tying up a server is very valuable.

Won't spammers just sign their connections?
No doubt, but that's hardly a worry, being able to identify the good senders fast is the most important goal.

Why don't you just signed the destination IP/port pair? That's known before the connection is made and avoids problems with NAT
TECS could just sign the tuple ( destination IP, port, epoch ) but I think it's a bit weaker than my proposal. Since the destination IP and port are fixed for a given MTA the signature is really a signature on the time. An eavesdropper could reply the signature within 30 seconds (or other timeone on the epoch value) and get an authenticated connection from any source IP address.

What about the MARID CSV proposal?
The CSV proposal is a lightweight (DNS-based), non-cryptographic method of estabilishing whether a host claiming a certain domain name in HELO/EHLO is authorized to be an SMTP client. Clearly, CSV aims to provide a simple method of determining whether a connecting SMTP client is authorized to be an SMTP client with the claimed name. This seems like a useful extension, but is very different from TECS. TECS operates at the level of a specific connection, and with an entity that is distinct from the domain of the SMTP client. This is valuable for two reasons: it allows the identity to be moved from SMTP service provider to provider, and it means that shared SMTP servers can operate claiming different 'responsible parties' for each connection. This latter point is important for ISPs that provide SMTP services to email marketers where the same SMTP server may be shared across many clients. This can result in a clean emailer being blacklisted because the IP of the shared server was blacklisted because of some other unrelated misbehaviour.

Whilst CSV is a useful extension which would help with the zombie problem, it does not address the needs at the connection level where I believe the problem needs to be addressed.

CSV also provides specific services for checking domain names against accreditation services. That is outside the scope of TECS, although the assumption is that such services would exist for TECS signed connections against the domain name claiming responsbility. The bottom line is that TECS deals with the party responsible for a connection, CSV the party responsible for the server.

What about mailing lists that forward mail?
By signing their connections they take responsibility for the mails they are sending. So mailing lists would need to have appropriate email policies in place for unsubscriptions, and deal themselves with spam to the list. Since the connection is signed any concern about munging of From: addresses for VERP handling, or adding headers/footers to email are irrelevant.

Is this compatible with SPF, Sender-ID, DomainKeys?
They are orthogonal. There's no direct interaction. Although, it might be sensible to use the _domainkey record from DKIM to obtain a public key thus sharing the same key between DKIM and TECS.

Will this reduce spam?
I'm not going to make any predictions. The goal would be to build a database that makes it easier to recognize someone who is legitimate, and scrutinize those who abuse the system or who choose not to sign.

What about anonymity?
Anoymous remailers are unaffected. They could sign their outbound connections with the system but that would not affect any changes they make to anonymize messages since its the conneciton, not the message content that's signed.

What if I change the mail servers or IP addresses I am using?
There's no effect. Keep signing the connections and you can take responsibility for any IP address you want to.

I think you are wrong, right, stupid, a genius.
Please comment here, or write to me directly.


Many thanks to all members of the REDACTED discussion forum, and to Toby DiPasquale.

Labels:

Thursday, February 01, 2007

Proposal for connection signing reputation system for email: TECS

IMPORTANT: This blog post is deprecated. Please read Trusted Email Connection Signing (rev 0.2) instead


The motivation behind TECS (Trusted Email Connection Signing) is that what managers of MX servers on the public Internet really care about is the ability to distinguish a good connection (coming from a legitimate sender and which will be used to send wanted email) from a bad connection (coming from a spammer). If you can identify a bad connection (today, you do that using an RBL or other reputation service based on the IP address of the sender) you can tarpit or drop it, or subject the mails sent on the connection to extra scrutiny. If you can identify a good connection it can bypass spam checks and help reduce the overall false positive rate.

Currently, the options used to identify a bad connection are rather limited (RBLs, paid reputation services and grey listing), and good connections are hard to manage (whitelists on a per-recipient basis, or pay-per-mail services). What's needed is a different approach.

There are also ideas like SPF, Sender-ID and DomainKeys which all attack the problem of protecting the integrity of the From: portion of a message.

TECS is different. The idea is to identify and determine the reputation of the entity connecting to a mail server in real-time without resorting to a blacklist or whitelist. This is done by signing the connection itself. With the signature on a per-connection basis a mail server is able to determine who is responsible for the connection, and then look up that entity's reputation in a database.

Current reputation databases are based on IP addresses. This is a very inflexible system: IP addresses must be added to blacklists very fast as spammers churn through zombie machines, and any legitimate emailer needs to make sure their mail servers are whitelisting with multiple email providers (e.g. Yahoo!, Gmail, Brightmail, ...) to ensure delivery. And if a legitimate mailer wants to bring on line new servers, with new IP addresses they have to run through the entire whitelisting process again.

This is inefficient. The mapping between IP address and entities (e.g. knowing that Google's Gmail services uses a specific set of IP addresses) is unwieldy to manage and the wrong level of granularity. Google should be free to add and remove email servers at will, while carrying their good reputation with them.

That's what TECS gives you.

Now for the how. To work TECS requires two things: a reputation authority and an algorithm. Let's start with the second.

Connection Signing

When a mail sender connects to an SMTP server wishing to sign its connection it issues the EHLO command and if that SMTP server is capable a new extension command TECS will be available. After the EHLO the mail sender then signs the connection using the TECS command.

The TECS command has two parts: an identifier (this is the unique identifier of the entity signing the connection, and thus taking responsibility for the messages send across the connection) and a signature.

Each entity has an RSA key public/private key pair. When signing a connection the entity generates a SHA-256 hash of the tuple . The destination IP/port pair is the IP address and port on the mail server that the mail sender is currently connected to; similarly the source IP/port pair is the IP address and port of the connection being used by the
mail sender. The epoch is the standard Unix epoch rounded to the nearest 30 seconds.

The entity making the connection then encrypts the hash with their private key, turns that into a hex string and uses that string as the second parameter to the new SMTP TECS command.

For example, an entity with the unique identifier 1b46ef4 might sign a particular connection like this:

TECS 1b46ef3d 5dde82a341863c87be1258c02ce7f80bf214192b

to which the receiving server could reply 200 OK if the signature is good (which they verify by generating the same hash and decrypting using the entity's public key), or with an error if the signature is bad (and they should probably drop the connection).

To get the entity's public key the receiving server needs to query the reputation authority.

Reputation Authority

The TECS reputation authority would be a non-profit organization that sells public/private key pairs and allocates entity IDs to verified entities. Money gathered from selling keys would be used to maintain the database of reputation information for each entity, and in ensuring the only reputable entities can obtain keys.

In the example above the receiving server would query the DNS TXT record of the domain name produced by concatenating identifier given in the TECS command with the name of the authority. Suppose that the authority was tecs.jgc.org then a DNS TXT query would go to 1b46ef3d.tecs.jgc.org.

The reply would consist of the ascii-armored public key for that entity and a reputation measure indicating the reliability of that user. The reputation measure would take one of 4 states: unknown (a recently issued key would not have any reputation), good (only a small number of complaints against this ID), medium (some complaints), bad (large number of complaints, probable spam source). The receiving server can verify the signature and use the reputation information to decide on the handling of the connection.

The authority would accept ARF formatted complaints consisting of abusive messages giving connection information, and the full text of the TECS command. They would then investigate to ensure that the reputation database contained up to date and useful information.

How much is a key pair going to cost?
I think it should be cheap for individuals ($25?), fairly cheap for non-profits and charities ($100?), and then a sliding scale for for-profit companies based on size (say $100 for a small company, $1000 for a big one?). The goal would be to make enough money to run the list.

What about mailing lists that forward mail?
By signing their connections they take responsibility for the mails they are sending. So mailing lists would need to have appropriate email policies in place for unsubscriptions, and deal themselves with spam to the list. Since the connection is signed any concern about munging of From: addresses for VERP handling, or adding headers/footers to email are irrelevant.

Is this compatible with SPF, Sender-ID, DomainKeys?
They are orthogonal. There's no direct interaction.

Will this reduce spam?
I'm not going to make any predictions. The goal would be to build a database that makes it easier to recognize someone who is legitimate, and scrutinize those who abuse the system or who choose not to sign.

What about anonymity?
Anoymous remailers are unaffected. They could sign their outbound connections with the system but that would not affect any changes they make to anonymize messages since its the conneciton, not the message content that's signed.

What if I change the mail servers or IP addresses I am using?
There's no effect. Keep signing the connections and you can take responsibility for any IP address you want to.

I think you are wrong, right, stupid, a genius.
Please comment here, or write to me directly.

Labels:

Friday, January 19, 2007

SpamOrHam shut down

Today I shut down SpamOrHam after a total of 357,380 messages were examined by volunteers around the world in 9 months. (If this is the first time you are hearing about SpamOrHam then read this).

At the same time, I'm happy to annouce that the associated competition was won by Alan Wylie in the UK. He clicked through 456 messages in a row without once disagreeing with the corpus gold standard classification. His prize is on its way to him today.

As promised I'm happy to make the data associated with SpamOrHam public. If you would like to get the details of all 357,380 classifications and how they match up to the TREC 2005 data please email me and I'll point you to the download location.

Finally, thanks to everyone who participated in SpamOrHam. I look forward to being able to report on the results of the experiment at a later date, and anyone who wishes to do their own analysis can simply ask for the raw data by email.

Labels:

Sunday, December 03, 2006

Two weeks of image spam innovation

Since I last blogged about image spam I've received numerous image spams myself, and reports from Nick FitzGerald, Sorin Mustaca and Nicholas Johnston about interesting image spams that I've just got to see.

On November 14, I received a report of image spams where the background noise had been updated from dots or small lines to polygons. Here's a sample:



A day later it was clear that the same spammer was randomizing the image size, the background noise and the color palette used:



Then on November 17 I was shown this interesting pump'n'dump technique:


83622874400056543 047183602660 41478311028418 100278 84807407350 05087016712772
78810870435635016 71651855827222 4725576405300038 84252840 12157351038885 630188325737443
23414 23133 41104 4312 7131 6341874402 48244 02522 1428 3224
55263 16021 42114 6654 7782 6583 2673 53280 03201 8323 6565
65882 58041 62412 6086607534050781 3826 0641 83176 85045 35427866406418
18506 15283 12474 33388136436533 643542 80456 87628 13156 506577110786230
31645 55602 81036 816525232877 301217585451283 01540 76265 0748 3312
06035 63450 43040 5224 1553 5786434504117661 52402 78534 8836 414
58510 75013 38325 3877 04146 76870 2788 34488 42803 77840 2737 5470
11372 62751178216028 2715 7266 17812 3668 10033 28833425366310 360215874450816
31403 486020746152 7723 4552 46471 6721 77207 683820875035 23460162005106

Nice, but that didn't last very long, but on November 22 the innovaters were at it again with smaller images containing polygons, lines, random colors, jagged text:



A day later the noise element changed to pixels:



Two days later spammers were trying a little 'old school' image spam with some fonts that they hoped would be hard to OCR.



Strangely the following image spam appeared on November 27 with a perfectly filterable URL. Oops:



And right before the month ended the noise around the border had turned into something like little fireworks:

Labels:

Thursday, November 16, 2006

Yet more spammer image optimization; this time it's pretty

These something new in the image-spam wave: pretty colours! This spammer is working hard to randomize his images and avoid OCR. Here's a sample:



And to give you an idea of the randomization here's another:



Thanks to Nick FitzGerald and Sorin Mustaca for samples. Notice how the letters are misaligned both vertically and horizontally to try to avoid OCR, and the background polygons are randomized. Also the aspect ratio and size of the messages have been changed for each image.

Labels:

Wednesday, November 08, 2006

Ransom note spam

Back in January I added a trick called The Small Picture to The Spammers' Compendium, and in August I updated The tURLing Test trick with an example of its use in image-based spam.

The Small Picture consists of sending individual letter images attached to a message. These letter images are then used to display a message and break up words that the spammer might think a spam filter would find suspicious. Here's an example of The Small Picture where certain letters (look carefully!) are formed using images rather than text:



The tURLing Test consists of disguising a URL by breaking it up and then explaining to the user how to type in the URL, thus proving that a human is reading the spam not a spam filter. This is done with URLs so that URL blacklists are bypassed. Here's an example of that from an image-based spam:



Now comes a combination of the two, that deserves the name 'Ransom Note Spam': it combines both The Small Picture (the letters are individual images attached to the spam) and The tURLing Test (the URL is made up of letters in the images):


Labels:

Friday, October 20, 2006

Why OCRing spam images is useless

Nick FitzGerald forwards me another animated GIF spam that takes the animation plus transparency trick I outlined in the blog post A spam image that slowly builds to reveal its message to a new level. And it shows why spammers will work around OCR as fast as they can.

Here's what you see in the spam image:



Looks simple enough until you take a look at the GIF file that actually generated what you see. It's animated and it has three frames:





The first image is the GIF's background and is displayed for 10ms then the second image is layered on top with a transparent background so that the two images merge together and the image the spammer wants you to see appears. That image remains on screen for 100,000 ms (or 1 minute 40 seconds). After that the image is completely blanked out by the third frame.

My favourite touch is that it's not the entire image that's transparent, not even the white background, but just those pixels necessary to make the black pixels underneath show through. If you look carefully above you can see that some of pixels appear yellow (which is the background color of this site) indicating where the transparency is.

That is darn clever.

Labels:

Monday, October 16, 2006

A spam image that slowly builds to reveal its message

Nick FitzGerald sent me a stunning example of lateral thinking on the part of a spammer. The spammer has taken a standard stock pump-and-dump spam image and split it horizontally into strips.



Each of the 17 horizontal strips cuts fairly randomly through the text making OCR on each strip not very useful. The spammer has then mounted each strip in its correct position on a transparent background and put each strip into an animated GIF. Here, for example, are a couple of strips:




The end result is that only once the entire image animation has completed is the complete spam visible making this a challenge for spam filters. And the spammer has thrown in a couple of frames at the end of the image, that get displayed after such a long delay (8 minutes) that they essentially never get shown. But those final frames are there just to throw off a spam filter trying to find the actual image.

Here's what gets displayed:



and here's the final image in the animation:



Very clever! (I'm calling this 'Strip Mining')

Labels:

Monday, October 02, 2006

Ye Olde OCR Buster

Regular spam-correspondent Nick FitzGerald writes with an example of a spam that he believes is trying to get around both hash busting and OCR in an image.



The image has random dots in the bottom left hand corner to mess up hashing of the GIF itself, and the fonts used are badly rendered unusual fonts.

Labels:

Wednesday, September 20, 2006

Watching a phishing attack live

Yesterday a phishing mail for a community bank in a US east coast state (throughout this blog post I have obscured many details including names, domains and IP addresses) slipped through GMail's spam/phish filter and then right through POPFile. Only Thunderbird bothered to warn me that it might be a scam.

The message itself was sent from an ADSL connected machine in China.

Of course, since I don't have an account with this bank it was an obvious phish, but I was curious about it so I followed the link in the message.

The link appeared to go to https://*****bank.com/Common/SignOn/Start.asp but actually went to http://***.***.164.158:82/*****bank.com/Common/SignOn/Start.html. Clearly a phish running on a compromised host.

A reverse DNS lookup on the IP address of the host revealed that the phish was being handled by a web server installed in a school in a small central Californian town. The machine appeared to be running IIS, but the phishing server identified itself (on port 82) as Apache/2.0.55 (Win32) Server.

The Start.html page was identical to the actual sign on page used by the bank. In fact taking a screen shot of the real page and doing a screen shot of the phishing page revealed that they were identical. Even the MD5 checksum of the images was the same. Naturally, not everything was the same in the HTML.

Although almost all the HTML was identical (with the phishing site even pulling its images off the real bank's site), the name of the script that handled validation of the user name and password had been changed from SignOn.asp (the actual bank uses ASP) to verify.php (the phisher used PHP).

The only significant diff between the phisher site and the real site is:

272c272
< <form action="verify.php" method="post" id="form1" name="form1">
---
> <form action="SignOn.asp" method="post" id="form1" name="form1">

Once a username and password was entered the phishee was taken to a page asking for name, email address, credit card number, CVV2 number and PIN (with the PIN asked for a second time for validation). After that the user was thanked for verifying their details.

The user name, password, credit card number, CVV2 number and PINs were saved to a file called red.txt in the same directory as the HTML and PHP files used to make the phishing site. How do I know that? Simple, by popping up one level in the phishing URL to http://***.***.164.158:82/*****bank.com/Common/SignOn/ I was able to get a directory listing. In the directory there were three HTML files, two PHP scripts and red.txt. Clicking on that file gave me access to the phished details as they came in.

I quickly informed the bank and US CERT of the phishing site. I tried to figure out how to contact the school, but it was 0500 in California.

Here's a sample entry from the actual log file.

###################################
Tue Sep 19, 2006 5:33 am
Username: youare
Password: stuipd
***.***.118.70
###################################
Tue Sep 19, 2006 5:34 am
cc: 4111111111111111
expm: 10
expy: 2006
cvv: 321
pin: 1122
pin2: 1122
***.***.118.70
###################################

The time is local to California and you can see the details that the person entered. Here clearly a vigilante has decided to mess with the phisher by entering bogus details. In fact, the last time I was able to access the site (before it was pulled down) there were 33 entries in the log file. Of these 32 contained nothing, or offensive user names and passwords.

But one seemed to contain legitimate information.

The log file had a first entry at 0454 California time from a machine owned by MessageLabs (I assume that they are doing some automated testing of phishing sites), the last entry was as 1226 California time.

The one legitimate entry contained a valid Visa card number (valid in the sense that the number validated against the standard Luhn check digit algorithm). Also the user name and password looked legitimate and a quick Google search revealed that the username was also used as part of the email address of a small business in the same town as one of this small bank's branches. It looked very likely that this entry was legitimate and the person had given away their real card number and PIN.

US CERT quickly responded with an auto-response assigning me an incident number and I received an email from the bank's IT Ops Manager Jack. Jack told me that he was already aware of the site and that this was the third time this little bank had been phished from machines in California and Germany. I gave Jack the name of the school in California, and he said he'd get in contact with them (he'd already called the FBI). I also told Jack about the one card number that looked totally legitimate; he told me he was in charge of all card operations at the bank and had the power to deal with it.

Some hours after that the site went offline.

Labels:

Friday, September 15, 2006

Image spam filtering BOF at Virus Bulletin 2006 Montreal

I'm leading a BOF meeting at Virus Bulletin 2006 in Montreal next month. The idea is to get together in one room for a practical, tactical meeting to share experiences on how people are currently filtering image spam and what might be done in future (and what we expect spammers to do). I've already got commitments from major anti-spam vendors to be there and talk (as much as they are permitted) about their approach and I'll try to cover what the Bayesian guys are doing.

If you are interested please email me, or post a comment here. If you represent a vendor and want to be involved I'm especially interested to hear from you as I want to get all experiences out on the table (as much as is practical).


Date and Time Confirmed: Thursday, October 12. 17:40 to 18:40 in the Press Room.

Downloadable PDF flyer.

Labels:

Wednesday, September 13, 2006

Apologia: Sophos and SoftScan

After reading all the blog posts, mailing list and personal mail concerning my post yesterday (Did SoftScan, Sophos and Panda rip off my blog?) I think I need to apologize to two of the companies involved.

As I mention in the updated post both SoftScan and Sophos explain that it's a conincidence and since I have no evidence that they copied stuff from this blog (even though it appeared on the front page of Slashdot before their PR), I think I owe them an apology. It probably would have been prudent of me to restrict yesterday's posting to just Panda and ignore SoftScan and Sophos.

*sigh*

*bows head in shame*

Labels:

Tuesday, September 12, 2006

Did SoftScan, Sophos and Panda rip off my blog? (Update: SoftScan and Sophos says 'no')

This morning I saw a news article about subliminal spam messages on ZDNet. I was intrigued to read about it because a few days ago Nick FitzGerald wrote to me with an example spam that he dubbed 'subliminal'. I wrote back and told him I was going to blog about it and he said go ahead.

The blog post is Subliminal advertising in spam? and was posted on Monday, September 4, 2006. That same day Slashdot picked up my blog post here. Later it was also picked up by Digg.

So I was a little surprised that the ZDNet article didn't mention Nick, me, my blog, Slashdot, or Digg. In fact, the article contains a link to Panda's press release on the subject: PandaLabs detects a new spam technique in which they state "PandaLabs has detected a spam message that uses subliminal advertising techniques.". No mention of this blog anywhere there either, but there are two images of such a spam, both of which I believe were lifted directly from my blog without attribution. The press release is dated the day after my post/Slashdot headline: Tuesday, September 5, 2006.

Here are the images side by side for comparison


Image from my blog post


Image from Panda's press release (local archive of the image)

And I named my image sub2.gif when I extracted it from the spam, and Panda named the same image sub2.gif. The MD5 checksum of my image is 9cace353b2d8b2db1d8868c07986f768 and the Panda image has the checksum 9cace353b2d8b2db1d8868c07986f768. And I also thought the original was a bit large for my blog so I reduced it from 603x451 to 302x226, the Panda image has the same reduced dimension. Hmm. Exactly the same image.

The other image in the press release is also, I believe, from my blog:


Image from my blog post


Image from Panda's press release (local archive of the image)

Once again, I named my image sub3.gif when I extracted it from the spam, and Panda named the same image sub3.gif. The MD5 checksum of my image is 6e16df2d3b67a7578ca7b09f0ccb9fc1 and the Panda image has the checksum 6e16df2d3b67a7578ca7b09f0ccb9fc1. Again I thought the original was a bit large for my blog so I reduced it from 603x451 to 302x226, the Panda image has the same reduced dimension. Hmm. Exactly the same image, again.

So it looks a lot to me like Panda heard about my blog post (perhaps through Slashdot) and then passed Nick's example off as their own research. Of course, it's possible that Panda the day after my blog post, independently found the same thing, named it subliminal spam, named the frames within the gif the same thing as me, extracted them from exactly the same spam image (which they managed to capture even though spammers are adding random noise so that hashing is impossible) and issued their press release.

On Wednesday, September 6, 2006 (two days after my blog post/Slashdot headline) Sophos put out a press release Spammers use subliminal messages in latest pump-and-dump scams in which they state: "Experts at SophosLabs™, Sophos's global network of virus, spyware and spam analysis centers, have identified a "pump-and-dump" stock spam campaign which uses an animated graphic to display a "subliminal" message to potential investors."

Once again the release doesn't mention me, Nick, this blog, Slashdot, Digg, ... It too includes an image that appears to be from the same spam campaign I was blogging about (a pump and dump for the stock TMXO), but there's no image borrowing here. The image is from the same campaign but different, and they no doubt didn't borrow any images from me.

Clearly, Sophos could have seen the same spam campaign as Nick and I and come to the same conclusion and called it 'subliminal' spam.

On Thursday, September 7, 2006 it appears that SoftScan got into the game too. They are mentioned in this article where it's written: "SoftScan's analysis of the latest pump-and-dump scam has discovered that an image appears for a split second every so often in the email with the word 'buy' repeated several times."

Disclaimer: I can't prove that any of these companies saw my blog post on Slashdot and then issued press releases, but the timing is interesting: my blog post comes first followed by press releases and articles using either the same image, the same campaign and all calling it 'subliminal spam'. Perhaps 'subliminal' spam was an obvious name, and I'm crazy, but...

An offer: on the other hand, if any company would like free reign to pass off things on my blog as their own work I have a simple offer for you: give me a small stock option in your company, call me a 'technical advisor' or similar, and feel free to take what you want from here.

UPDATE: SoftScan's Corporate Communications Manager Bo Engelbrechtsen comments below (see comments section) that they independently found this, and had never heard of this blog before.

UPDATE: In a private email a Sophos employee I know well says: "I personally alerted Sophos's PR team about this spammer trick [...] The word "subliminal" was the first thing that came to my mind when I saw it. [...] I don't read John's blog and am very disappointed with this insinuation. We receive millions of spam e-mails to our traps every day, many of which get analyzed and looked at by spam analysts around the world. We don't need to steal someone else's story..."

Labels: