Monday, May 15, 2006

There's one born every minute: spam and phishing

It's been a little while since I launched where people perform a spam filtering task and their results are compared against best of breed spam filters. I set out to make sure that the spam filters were doing a good job on the assumption that people would be able to spot errors that the filter was making.

Bad assumption. It turns out, based on preliminary data, that people suck at spam filtering. Here's some initial figures: people agree with 89.1% of the classifications that they've examined. Now that could mean that the original spam filter sucked, but guess again!

Ignoring all the emails that have only been voted on once, and looking at the emails that have been seen by multiple people (who've agreed that they believe that the message is a ham or a spam), there are some really surprising results:

Here's one that people think is a spam:

and this one too:

and many people think this US Airways message is spam:

Now for the prize winning classification. The people who thought the following phish was a genuine message, could you please forward your bank account details and PIN to me so that I can deposit your prize in your account:

Happily, people are finding genuine errors that the spam filter made. For example, this really is a genuine message from Travelocity and not a spam:



Anonymous Michael Clark said...

It would be cool if on SpamorHam I could see how many decisions I made matched the original classification.

The Captcha isn't working. :(

2:53 PM  
Anonymous Michael Clark said...

never mind, the captcha works, and it the system is telling me my stats. Cool.

2:54 PM  
Anonymous Anonymous said...

I've got to say, that the judgement of some of those emails as "not spam" is suspect. I've gotten mail from sites that is "legitimate" in the sense that it was actually sent by the company in question, but it's spam since it's an unsolicited advertisement, from an opt-out type of mailing list.

3:38 PM  
Anonymous Anonymous said...

These messages are all Spam....

3:48 PM  
Anonymous Anonymous said...

There should be a method to 'take back' a classification once you submit it.

I was going through the for the first time today, and one of the messages lagged. I thought I didn't hit submit - so I re-clicked, and incorrectly catagorized a message.

I actually ended up doing that a couple times - realizing after I went to the next one.

3:53 PM  
Anonymous Anonymous said...

Erm, if I was to receive any of those, I would consider all of them spam and I would be correct.
Why? Because I had no prio dealings with any of those individuals or companies (except Paypal, but that one was given away by he URL).
I'd assume many ppl taking the survey had the same thought.
That is why a spam filter must be trained.

3:56 PM  
Anonymous Anonymous said...

The trouble with these examples is that for some people these are spam - and for others they are not. If I do a lot of business with Travelocity - then that last message is a useful, genuine offer - and not spam - however, if I've never dealt with them before and have no interest whatever in their offer then it's irrelevent whether it really came from them or not - it's most definititely spam when it's sent to me. That was true of nearly every example you came up with (except maybe the blatent phishing "you have won a prize" example).

Even your first example (which appears to be a perfectly legitimate business memo) might be an attempt to find out whether my email address is "live" or not so I can be added to some widely sold spam list. The only way to know is to ask whether I know these people - and whether what they are talking about is relevent to my work. For me PERSONALLY, it's obviously spam because I don't know any of the people who are talking. So, yes, if I got this email, I'd label it spam in a heartbeat. However, if the people mentioned in the email were known to me - then I'd say it was obviously genuine. Using that example in your survey without making it clear that you actually know these guys make the responses you got back meaningless.

This is the problem for spam filtering - it can't be done properly without intervention from the end user. The Baysian filtering approach (eg as used in Mozilla/Thunderbird) is fairly effective - for me, it gets rid of at least 90 to 95% of spam and I've yet to see a false positive from it after using it for nearly 2 years.

4:04 PM  
Anonymous Anonymous said...

Your point being?

To me, all five messages your quoted are spam. Because I have no idea who the f*ck is Alice... er... the guy you're discussing credit decisions with (My rule: anything about credit over the phone ONLY!), ALL jokes received in email are SPAM no matter whom they came from (thank you, I get all my jokes on humor websites, no need to clog my mailbox with them), and any stuff from companies I didn't explicitly request is SPAM (and I do not shop at Travelocity). Therefore, one's man's treasure is another man's trash. :-b

4:06 PM  
Anonymous Anonymous said...


4:06 PM  
Blogger Patrick McKenzie said...

Hideho. I classified 30 messages for you the other day, and one thing I wondered was "What is the context of this message in this inbox?" For the first dozen or so it wasn't obvious that the (a?) inbox was from Enron, so messages about pricing levels for crude oil were solidly topical and that bit about credit for the customer was not. In my own inbox, the .5 second visual inspection I give to a new mail to decide whether to read it or not would have trashed either of these subjects because they have no legitimate purpose *to me*. It might be a weakness of this distributed-classification approach.

4:10 PM  
Blogger Flamsmark said...

you seem to have a pretty interesting definition of spam. i'd classify many of those messages as spam if they arrives in my inbox, buecause they're unsolicited, and not sueful to me.

4:21 PM  
Anonymous Anonymous said...

As far as I am concerned, if I receive any message dealing with advertising of any sort... I consider it spam. Even if you are getting a deal.

4:29 PM  
Anonymous blindcoder said...

Spam detection is based heavily on the receiver.
If I get a mail claiming to be from an american bank I can happily delete it since I am not a customer of an american bank.
An american citizen might want to check twice.
Same goes for german Postbank-Scams. I am not a customer there so 'd' goes the mail.

OTOH, I am a paypal customer. If I get a mail claiming to be from paypal I'm going to have a look at it. Someone else will delete it unread.

I'd have simply deleted the E-Saver mail, too.

4:30 PM  
Anonymous Anonymous said...

I'm not sure why you're saying the Travelocity or US Airways emails are not spam.

I'm aware of the distinction you're making: they're not blast emails targeting millions for unwanted penis enhancers.

However, they still are advertizments freeloading on my purchased bandwidth. Why should I NOT consider them spam?

4:34 PM  
Anonymous Anonymous said...

It might be genuinely from Travelocity, but in my book, it's still SPAM.

4:37 PM  
Anonymous Anonymous said...

One of the big issues is just what people think the defintion of "spam" is. Is it ANY commercial email advertisement? Is it just unsolicited email? There are lots of (acephallic?) people out there who love be on the Fruit-of-the-Week club's weekly fruitymailvert. They're annoying, but possibly not unsolicited. Yet, without knowing the context, a third party may not properly assign this as "not spam." Without knowing the recipient's "opt in" status, it looks like spam.

Anyway, most spam filters are smart. Most people are not. With that, we can agree.

4:46 PM  
Anonymous Anonymous said...

I'm sorry for the people who think this is SPAM, this is SCAM.

SPAM is just e-mail trying to SELL you something, these messages want your information so they can STEAL your money.

4:55 PM  
Anonymous Anonymous said...

Here is my low-tech manual spam filter:

1. If it's not a response to an E-Mail that I sent somewhere, it's spam, and even responses to E-Mails I've sent somewhere are sometimes too gratuitous.

2. If the message contains images and/or HTML, it is most likely spam (or some forwarded joke mail I don't really want to read).

3. If my name or E-Mail address appear in the subject line, it's most likely spam.

4. If there is more than one spelling error in the subject line, the message is probably spam. Or else my friends/family/business associates are dull and I'm not going to have fun trying to decipher the body of the message.

4:59 PM  
Anonymous Scott Frazer said...

The Travelocity emails aren't spam because you have to specifically opt-in to them. I receive them, because I find them useful.

If you get those and don't want them, you should follow the unsubscribe instructions.

It makes sense to _not_ follow the unsubscribe for some email (as it will just confirm the addres) but most reputable companies will honor removal requests.

5:11 PM  
Blogger Edward Lee said...

People laugh at me for using PINE, but by looking at the actual text of an HTML email, it's easy to check the destination of questionable links.

5:12 PM  
Anonymous Anonymous said...

Really Interesting would be if the anti-spam filter could see outgoing email too, and when replys come in, email can be filtered; using the serial numbers , just the way email clients do threading.

This in base of having a relation with the one sending an email to me.


5:14 PM  
Anonymous Anonymous said...

Really Interesting would be if the anti-spam filter could see outgoing email too, and when replys come in, email can be filtered; using the serial numbers , just the way email clients do threading.

This in base of having a relation with the one sending an email to me.


5:15 PM  
Anonymous Anonymous said...

What's missing is context - I know what marketing emails I'm subscribed to.

Also, if I see an email like the first, if it's MY mailbox, if I sent the thing that the 'spam' is in response to, then it's not likely that I'll think it's spam.

5:23 PM  
Anonymous Anonymous said...

There will always some ways to fool most smart spam filters...


5:23 PM  
Anonymous Anonymous said...

Any such message that gets to my inbox is spam because they all are in english.
As simple as that.

An even in the case that they were sent in the proper language they would still be spam because I don't know anyone of the senders or have any relationship with those companies.

Travel offers? pure spam, no matter how you look at it unless you have explicitely requested that info.

5:25 PM  
Anonymous Anonymous said...

I think it depends a lot on the user. The travelcity one for instance, is clearly directed at travelcity customers and when they signed up they probably ticked (or didnt untick) an option to okay such emails. So, while its annoying its not unsolicited, so not really spam.
I do consider such emails very annoying, and get especially annoyed when I saw nothing about it when signing up to the company. But, most companies like that do offer ways to stop receiving the emails, and that is a better and more reliable option than the spam filter for those messages.

5:28 PM  
Anonymous Anonymous said...


All these people are missing the point. They're looking at the emails from the wrong point of view. Its not if they're a customer of a bank or not. Its whether they would recognize that email as spam or not in a general sense. Don't look at it too personaly.

5:49 PM  
Blogger JoeChongq said...

If you signed up for a service that includes them sending you email, how can that be considered spam? It may not be useful to you, but you signed up and can easily unsubscribe. Mixing emails that are not useful to you with unsolicited emails from companies that you have no relationship at all is why distributed spam filters don't work. You sign up to a newsletter, decide you don't want it anymore and start marking it spam.

I hate getting jokes and chain forwards in email, but they are not spam. The same goes for other messages in this corpus you might not recognize. Just because you don't deal with PayPal or doesn't mean that the emails in this study are necessarily spam. To the recipient they may not have been spam.

To do what John is attempting here (verify or improve the accuracy of the test corpus), we need to be realistic about what was spam and not spam for the original recipient. I know that is hard to do, I have often been unsure or realized later I marked a few wrong. There is a "I'm not sure" button for a reason. Just marking every commercial email as spam is not helping at all.

5:59 PM  
Anonymous Anonymous said...

You know, I only use my Hotmail mailbox for recieving receipts of things I order online. I don't have paypal, never had, Nor Have I EVER purchased tickets from an online supplier (i.e. travelocity) NOR have I ever rented a car, purchased a laptop online, or bought a car online. So to me, the 20+ admails I get are SPAM, because I haven't ever used their services, so there is absolutely no way they have a valid reason to send me those emails.

6:19 PM  
Anonymous Anonymous said...

I think you guys are all missing the point. I am constantly monitoring emails that are blocked by our spam filter for false positives and have no way to determine if they were solicited or not. I can only go by what I've seen spam look like before, and I have to say that I'm pretty accurate. I can tell a spam from a ham rather well because of a few tell-tale signs in the content and the header.

Although it is good practice to instruct your users to only consider solicited emails to be legit, I think a good administrator must be familiar with and have the ability to pick out phishing scams and spams in a heartbeat.

6:41 PM  
Anonymous Anonymous said...

Those e-mails ARE spam. Just because they're not phishing e-mails, that doesn't mean they're not spam. Spam is basically any "generic" email you receive without specifically asking for it.

7:07 PM  
Anonymous Ed Murphy said...

The anonymization of header lines (s/originalcompany/ENRON/) was left incomplete for at least one multi-line header (To: name1, name2).

8:08 PM  
Anonymous Ed Murphy said...

Also, um, "Trial Copy" is partly visible at the top of your screenshots. What program are you using to (I assume) emulate OE without actually running OE itself (warts and all), and do you intend to register it? :)

8:11 PM  
Anonymous KE said...

Does it count as spam if you signed yourself up for the newsletter, and don't enjoy it?
Some newsletters don't even have an unsubscribe feature. I would count THAT as spam.

8:18 PM  
Anonymous Anonymous said...

I just tried it, and I think that with a disagreement rate of 1 in 100 (as I got) it wouldn't hurt to ask for confirmation.
I clicked the wrong button once after doing more than 80, because I lost concentration. Otherwise, you might want to limit or throttle the amount done in one session, or add a small reward or attention grabbing device every - say - 25 mails. (Though the captchas may work for this purpose to some extent.)

9:28 PM  
Anonymous Anonymous said...

USAirways E-savers is an opt-in mailing list. It's not spam.

9:33 PM  
Anonymous Anonymous said...

I saw several in there classified as 'spam' that were clearly from opt-in lists. ESPN Football news of the day, etc.

I do love the fact the database is largly populated with Enron email though.

10:05 PM  
Anonymous Anonymous said...

That's nice and most of the examples are good. However, the header information is usually where I look to see if a message is Spam. (I run my own IMAP email server)
If your headers actually made sense instead of all originating from the same domain, this exercise might be useful.

10:25 PM  
Anonymous Anonymous said...

it is practically IMPOSSIBLE to tell pishing from legit email. With the ability to "spoof" URL's, someone can make a fake URL look real. The PayPal example on the site looked real if one was to just look at it. People need to start either examinaming ALL the html source and headers or we need a shift back to plain-text email.

10:46 PM  
Anonymous Wayne Stewart said...

One simple rule I use for Spam is checking the spelling or grammer. If there is an error - it's probably spam.

11:01 PM  
Anonymous jes5199 said...

I use the "this is spam" button as a punishment button.
You can't write in complete sentences? You can't make a clear point? SPAM! Ha! see how it feels.

11:48 PM  
Anonymous Nelo said...

Spam is generally defined as Unsolicited Commercial Email (UCE). As such, I would argue that the US airways and Travelocity emails are spam - it really depends on whether they were solicited.

12:04 AM  
Anonymous Nelo said...

Spam is generally defined as Unsolicited Commercial Email (UCE). As such, I would argue that the US airways and Travelocity emails are spam - it really depends on whether they were solicited.

12:04 AM  
Anonymous Mr Mc Chewy said...

Can all of the above responces be considered spam?
Think about it. ;)

1:23 AM  
Blogger The Mushroom said...

I marked the Travelocity messages as spam (I did over 200 msgs today) because it's junkmail, even if the recipient opted in to get it. Ticketmaster, Radio Shack, and Major League Baseball have the same systems, where if you participate in something on their site (buying tickets, voting) and somehow miss the "leave me alone" option you get your box filled. The filter agreed with me that it was crap, even if it was opt-in crap.

I did see one in my session today where the name and email address matched, but it was an advertisement for a local plumbing company. Just an example of a legit business resorting to advertising by email -- a bad choice, right up there with fax advertising but not illegal -- but it's unsolicited email nonetheless, ergo spam. End result would be that the business would soon not be able to email some existing client or others because their domain would be blacklisted. Spam kills, plumberboy.

1:26 AM  
Anonymous Anonymous said...

It's ALL Spam !
Unsolicited e-mail IS SPAM !

And who the heck wants advertising sent to them ?

When I need plane tickets I'll go shopping.
So Stick yer "e-mail advert special" where the sun don't shine...

3:19 AM  
Anonymous Ted Cooper said...

This will mostly get lost in the /.ing but hey..

I have to agree with you that people are terrible at identifying spams and scams, however this is simply because they haven't been educated in being able to identify them.

However, after looking at, the emails shown do not have enough information to be able to accurately tell if a message is ham or spam. The automated spam scanners have the benefit or all the headers and the raw format of the emails. Just looking at the default Microsuck Outlook display of email makes it exceedingly difficult.

If users are to be educated in identifing what's crap and what's not, they have to be given the tools needed and then told what to look for.

5:41 AM  
Anonymous Anonymous said...

Just use spamgourmet or bugmenot to avoid the stupid "solicited but unwanted as a result of a stupid signup form trick". I make certain all signups are completely unusable by the data miners. Then again they probably have "Spam" filters on those as well to root out fake info.

6:26 AM  
Anonymous Anonymous said...

Wayne Stewart said "grammer",
so he must be a spammer!

(according to his own definition)

6:50 AM  
Anonymous Anonymous said...

There's obviously a huge confusion over what exactly constitutes spam. Some people (including /.) even mix up spam and phishing.

For me, chain letters are spam even if they're not commercial. Any mailing list I didn't very explicitly subscribe to is spam. It doesn't matter if the spam comes from a legit company and is legal per some U-CAN-SPAM law, it's still spam. Ad-ridden "newsletters" sent by my ISP with no option to unsubscribe are spam.

7:00 AM  
Anonymous Anonymous said...

I went fishing the other day. Didn't catch a thing. Came home and ate spam.

8:34 AM  
Anonymous Anonymous said...

The spam filter at the company I used to work for would consistently let the pharmas and enhancers through...but would faithfully send EVERY email from my b*st*rd boss to the spam folder.

Meant I had to look at the spam filter every few hours to keep him from jumping down my throat because I hadn't responded to an email he'd sent.

12:52 PM  
Anonymous Anonymous said...

I'm adding to the quiet chorus of people saying "That's not how I read my email". A great deal of information about whether something is legitimate or not is available in the most recent Recieved: header lines, and working with those associatively with the body and subject of the email is something that traditional anti-spam tools really haven't taken up. It's also something that's really hard for botnets to forge or work around.

Consider the following:

Received: from -1208043408 ( [])
by (Postfix) with SMTP id AB94134F9E

2:56 PM  
Anonymous Anonymous said...

I'm adding to the quiet chorus of people saying "That's not how I read my email". A great deal of information about whether something is legitimate or not is available in the most recent Recieved: header lines, and working with those associatively with the body and subject of the email is something that traditional anti-spam tools really haven't taken up. It's also something that's really hard for botnets to forge or work around.

Consider the following:

Received: from -1208043408 ( [])
by (Postfix) with SMTP id AB94134F9E
for <[email protected]>; Mon, 15 May 2006 18:43:28 -0500 (CDT)

That's EASILY identified as being Not The Right Source for most emails that might otherwise look like good phishing mail; its not a reasonable source for any email purporting to be from a bank, Amazon, Paypal or eBay, or just about any other recognizable business. Additionally, that particular recipent address tends to get correspondence about a very limited subject set, and can weight any further analysis by a very different set of criteria than other emails ending up in the same inbox.

3:05 PM  
Anonymous Nick B. said...

I worked for an Internet Service Provider for the better part of 4 years. Frequently, did we send our customers information about or service via e-mail. Those messages along with other mail they solicited themselves, where being blocked, unread and deleted, or otherwise forced a call to Technical Support, asking why they were appearing in their inbox.

Solicited e-mail is NOT spam. If you have a USAirways esaver subscription, how ELSE are you going to be "automatically updated" on special priced flights?

If you signed up to receive a newsletter, or some other type of e-mail update... then it is NOT spam.

Readers of this website, and end-users in general need to pay attention to what they are receiving. If you choose to not want to read anything containing the word "viagra" or "breast enhancement" then set up your crafty little filter. But when my company (of which you are a customer) needs to contact you via e-mail... please, read the darn thing!

Oh.. and just to offer my own simple clearification to the kiddies out there.. Spam is a message that's trying to sell you something.. Phishing is a message asking you for information that could lead to them stealing from you. It's not really hard to tell the difference.

"Viagra.. $39.95" = SPAM

"Click here, and enter your credit card information to receive a free gift." = PHISHING

3:58 PM  
Anonymous Anonymous said...

Please insert a link to right under the code for your image (replacing etc with the same values in of course). That way those of us who are blind can participate. It still works now, it's just that I have to click view source to get your random string and construct the link myself.

6:48 PM  
Anonymous Anonymous said...

NickB, I don't care where you work, you're wrong.

You say spam is email trying to sell you something. That's not true. Probably most email trying to sell you something is spam, statistically, but not all email trying to sell you something is spam, and not all spam tries to sell you something.

Spam is UBE - Unsolicited Bulk Email.

If I sign up for a mailing list and it has advertisements trying to sell me something - that doesn't make it spam.

If you send 9 million random people ccs of your latest rant on whatever subject, you're not trying to sell anything, but it's still spam.

The notion that it's possible for anyone to determine whether or not the emails listed are spam, without context, without headers, without any other information than is shown, is ludicrous. Of course people are bad at this task - it's an impossible task.

7:45 PM  
Anonymous Anonymous said...

The captchas timeout is way to low. It is a real punishment to help to categorize the spam/ham messages. After a bunch of messages you need again to write the captchas. What a punishment?

8:52 AM  
Anonymous Andy Canfield said...

I have recently converted my Evolution system to "white box", so each message has four possible routes:
[1] rejected as SPAM by POP3, according to rules maintained by the ISP.
[2] rejected as SPAM by Evolution, according to messages I've marked in the past as spam.
[3] unsure, put in INBOX
[4] For sure not SPAM, according to rules I have crafted, put in WHITEBOX.
Works pretty good. The rule that Evolution does not support, which I want, is "If the sender is in my address book, it's not spam." That is why SPAM filtering must be done at the client; because the ISP can not know what is in my address book.

1:10 AM  
Anonymous Mark said...

Indeed whitelisting is the best method to get past spam for me. A list of valid e-mails are put in my mail folders for them ([email protected] -> list etc) .. known SPAM either %spamhigh or deleted (SA scores >6). The rest reach my INBOX. So far my spam has been controllable with this method.

I get 1-2 e-mails a day in my INBOX of which most are SPAM that has leaked through.

3:01 PM  
Anonymous Anonymous said...

I agree with some of the comments here that advicing on spam without context is a futile effort. When I read my mails, I know what companies I've had previous contacts with and are able to act accordingly when I see mails from them. But to someone else without that knowledge, a fair amount of the mails I get from companies I've had contact with would be classified as spam. Without previous knowledge about the intended recipient of the messages and his or her interactions with companies and individuals beforehand, it's impossible to tell which is spam and ham.

11:06 AM  
Anonymous Anonymous said...

Of 35 examples I got I missed one, a spam and pushed the wrong button for a ham.

Have since swept the net for all e-mail addresses and published them in wildcard searchable format in Feb 2002, had a major problem with spam, until I in 2003 found out that if I used the included URL's and 0800- numbers, I could reduced spam substantially.

Last year I went from 150 spams a day to 250 in two months, but of those only 1-2 spams passed my url/phone filter. One reason was that I at that time implemented lists, instead of doing my owe and put all not identified mails in a loop for 20 minuts before testing them again with a "real-time" list surbl published.

Unfortunatly my ISP couldn hold the load, but they installed Spamassassin with surbl support and then I made a procmail filter mimicing some of my filter, so I still have no more than 1-2 spams in/day.

Spam have one difference to virus mail and ordinary mail, they want to "sell" something, i.e. they have a need to get the target to contact them. A virus maker doesn't want to have contact, just execute their code and normal mail does not want to sell, and that is the hook to use.

1:03 PM  

Post a Comment

Links to this post:

Create a Link

<< Home