Friday, March 19, 2010

Responding to jacquesm's challenge

The other day on Hacker News a user posted an anonymous comment. Regular Hacker News participant jacquesm wanted to unmask the writer and posted a challenge to unmask the user.

He also emailed me because he thought I might be the anonymous user. I agreed to help him with a little bit of text mining. Jacques had a nice database of all Hacker News submissions and comments and gave me a 250Mb SQL file suitable for loading into MySQL. Unfortunately Jacques' database only had comments up until September 2009 so if the user who wrote the anonymous comment has joined recently then it won't be possible to identify them.

I quickly whipped up a naive Bayesian text classifier in Perl (similar to this one that I wrote about in Dr Dobbs five years ago) with one category per Hacker News user. I took the text of the anonymous comment and fed it to the classifier. The classifier used the complete set of comments for each user to train each category and then scored the anonymous comment against those categories.

Culling the list for users who have commented recently the 10 most likely users (according to a text classification) are (in order of likelihood): sh1mmer, stcredzero, gojomo, patio11, andreyf, anigbrowl, teej, physcab, thorax, run4yourlives.

Now none of those users actually commented on the thread in question. Assuming it was someone who was commenting and then switched to a different account the most likely candidates are (again in order): petercooper and jacquesm.

So did this approach work?

PS As a quick test of my classifier I ran it against one of jacquesm's own comments and got the following people in order of likelihood: jacquesm, geoscripting, kunqiana, jerf, mixmax.

Labels: ,

Friday, February 12, 2010

So you think machine learning is boring...

Here's something I wrote for the company blog.

If you say the words 'machine learning' to people they either look confused or bored. Since the promise of Artificial Intelligence evaporated in the 1970s, machine intelligence seems to be one of those things that's a perpetual 20 years away.

But computer industry insiders know that many forms of machine learning are at work all the time. The most common and visible are recommendation systems like the one on Amazon.com that comes up with suggestions for other books you might like. But even that doesn't express the true power of state of the art algorithms.

But a helicopter doing backflips and a walking robot do.

Read the rest.

Labels: ,

Monday, October 05, 2009

Finally out of stealth, here's Causata, Inc.

Wow, it's been months and months that we've been working quietly to build our software platform, recruit more team members, design the UI, do user testing, ... But, finally, Causata is here.



I won't be blogging about Causata stuff here because there's a corporate blog where you can read my first blog post about working at Causata.

My biggest concern right now is recruiting. If you know a lot about Java or JavaScript and are based in London or Silicon Valley please get in contact.

Labels: , , ,

Thursday, January 17, 2008

Another use of POPFile: detecting weakly encrypted email

Almost all users use POPFile as a spam filter, most of them also use the fact that POPFile can sort in arbitrary categories of mail. However, some people have pushed POPFile even further... Martin Overton (of IBM) has used POPFile to discover email borne malware, even finding that POPFile could automatically detect mutations. Now, some researchers in Japan have used POPFile to detect weak encryption of email with 80% accuracy.

The researchers were building a system to detect improper sending of personal information by email. Their system first checked for the use of strong encryption (if the mail is strongly encrypted then there's no need to worry about eavesdropping), the system also checked for things like telephone numbers, email addresses and other personal data in non-encrypted mail.

But they also wanted a system to detect poor encryption (such as ROT-13), and for that they turned to POPFile. After a mere 30 emails had been trained in POPFile it was able to distinguish plain text messages from those encrypted with weak ciphers.

Some details in their paper.

Labels:

Wednesday, March 28, 2007

So how much difference do stopwords make in a Naive Bayes text classifier?

POPFile contains a list of stopwords (words that are completely ignored for the purpose of classifying messages) that I put together by hand years ago (if you are a POPFile user they are on the Advanced tab and called Ignored Words). The basic idea is that stopwords are useless from the perspective of classification and should be ignored; they are just too common to provide much information.

My commercial machine learning tools did not, until recently, have a stopwords facility. This was based on my belief that stopwords didn't make much difference: if they are common words they'll appear everywhere and probabilities will be equal for each category of classification. I had a 'why bother' attitude.

Finally, I got around to testing this assumption. And I can give some actual numbers. Taking the standard 20 Newsgroups test and using a (complement) Naive Bayes text classifier as described here I can give you some numbers.

The bottom line is that stopwords did improve classification accuracy for 20 Newsgroups (and for other test datasets that I have), but the improvement was by 0.3%. The stopword list used was the top 100 most frequently occurring English words.

So, I was wrong... they do make a difference, but not by much.

Labels:

Thursday, March 15, 2007

Calibrating a machine learning-based spam filter

I've been reading up about calibration of text classifiers, and I recommend a few papers to get you started:

The overall idea is that the scores output by a classifier need to be calibrated so that they can be understood. And, specifically, if you want to understand them as a probability that a document falls into a particular category then calibration gives you a way to estimate the probability from the scores.

The simplest technique is bucketing or binning. Suppose that the classifier outputs a score s(x) in the range [0,1) for an input document x. Once a classifier is trained it's possible to calibrating it by classifing known documents, recording the output scores and then for each of a set of ranges (e.g. divide [0,1) into 10 ranges [0.0,0.1), [0.1,0.2) and so on) count the number of documents in each class, and get a fraction the represents the probability within that range. Then when you classify an unknown document you look at the range in which its score falls to get probability estimates.

For spam filtering each range would consist of the number of emails that are ham in the range, and the number that are spam. The ham probability then just comes out to # of ham emails in the range / total number of emails in the range.

I decided to run a little test of this and took the complete TREC 2005 Public Spam Corpus and trained a Gary Robinson style spam filter on it. Then I classified the entire corpus to get calibration data. Instead of looking at the final scores, I looked at the pair of scores (one for ham, one for spam) that the classifier generates and used those to generate bins. The following chart shows (crudely) the percentage of spams and hams in each bin:



The x-axis is the ham score in the range [0.0,0.9) with each square representing a single 0.1-width bin. The left-most square means that the classifier had a very low score in terms of hamminess, right-most square means that the classifier had a very high score.

The y-axis is the spam score in the [0.0,0.9), so that the bottom means a low spamminess score and the top a high one. Each square is colored red and green. Red is the percentage of messages in that bin that are spam, and green the percentage of messages in that bin that are ham.

So as you'd expect the top left corner (low ham score, high spam score) is generally colored red indicating that these are spam messages, and the bottom right corner (low spam score, high ham score) is colored green (lots of ham). Some squares are black because there's too little data (I arbitrarily said that if there were less than 5 messages in the bin it was colored black).

But there's a really interesting square in the top right: it's almost all red (in fact, it's 93% red or spam). So we can have confidence that a message there (which corresponds to a final 'score' of 0.5) is in fact spam.

I'm sure that there are other interesting insights to be gained from calibration and I'd like to spend the time to evaluate the effectiveness of using the probabilities generated in this way (instead of the rather arbitrary selection of a score 'cutoff' point) as the way to determine whether a message is spam or ham (or undetermined). For example, the square at (0.9,0.2) (which corresponds to a 'score' of 0.18 is 57% ham, 43% spam so looks like a good candidate for undetermined; it looks like a much better candidate than the typical "area around 0.5" (which corresponds to (0.9,0.9) and is 93% spam).

Labels: ,