Monday, December 14, 2009

An open source project for my Met Office data analyzer

Since some other people have been playing with my little Perl program to analyze the Met Office land surface temperature data, I've registered a project at SourceForge so that others can work with me on it.

I've also imported my latest version of the script which outputs data about the number of stations used to create the gridding data, and does cosine weighting of the northern and southern hemisphere trend data.

It can all be found at Land Surface Temperature Analyzer.

Labels: , ,

Monday, January 07, 2008

First release of my 'shimmer' project

A couple of months ago I blogged about a system for open and closing ports based on a crytographic algorithm that makes it hard for an attacker to guess the right port. It's a sort of port knocking scheme that I called C3PO.

Many commentators via email or on the blog and in other forums requested that I open source the code. I couldn't do that because the code was a nasty hack put together for my machine, but I gone one better.

Today I'm releasing the first version of shimmer. shimmer is completely new, GPL-licensed code, implementing my original idea. Read more about it on the site.

Hit the right port and you're in, hit the wrong one and your blacklisted. Ports change every 60 seconds.

Labels: , ,

Tuesday, July 03, 2007

Please vote for POPFile

POPFile has been nominated for a SourceForge Community Choice Award through the efforts of its users.

Now it's time to vote.

If you love POPFile, please vote for it in the Best Project for Communications category.

Labels: ,

Monday, June 18, 2007

POPFile v0.22.5 Released

Here are the details:

Welcome to POPFile v0.22.5

This version is a bug fix and minor feature release that's been over a
year in the making (mostly due to me being preoccupied by other


SourceForge has announced their 'Community Choice Awards' for 2007 and
is looking for nominations. If you feel that POPFile deserves such an
honour please visit the following link and nominate POPFile in the
'Best Project for Communications' category. POPFile requires multiple
nominations (i.e. as many people as possible) to get into the list of



1. POPFile now defaults to using SQLite2 (the Windows installer will
convert existing installations to use SQLite2).

2. Various improvements to the handling of Japanese messages and
improvements for the 'Nihongo' environment:

Performance enhancement for converting character encoding by
skipping conversion which was not needed. Fix a bug where the
wrong character set was sometimes used when the charset was not
defined in the mail header. Fix a bug where several HTML
entities caused misclassification. Avoid a warning 'uninitialized
value'. Fix a bug that the word's links in the bucket's detail
page were not url-encoded.

3. Single Message View now has a link to 'download' the message from
the message history folder.

4. Log file now indicates when an SSL connection is being made to the
mail server.

5. A number of small bug fixes to the POPFile IMAP interface.

6. Installer improvements:

Email reconfiguration no longer assumes Outlook Express is
present. Add/Remove Programs entries for POPFile now show a more
realistic estimate of the program and data size. Better support
for proxies when downloading the SSL Support files. The SSL
patches are no longer 'hard-coded', they are downloaded at
install time. This will make it easier to respond to future
changes to the SSL Support files. The Message Capture Utility
now has a Start Menu shortcut to make it easier to use the
utility. The minimal Perl has been updated to the latest
version. The installer package has been updated to make it work
better on Windows Vista (but further improvements are still



An introduction to installing and using POPFile can be found in the
QuickStart guide:


SSL Support is offered as one of the optional components by the
installer. If the SSL Support option is selected the installer will
download the necessary files during installation.

If SSL support is not selected when installing (or upgrading) POPFile
or if the installer was unable to download all of the SSL files then
the command

setup.exe /SSL

can be used to run the installer again in a special mode which will
only add SSL support to an existing installation.


The current version of SQLite (v3.x) is not compatible with POPFile.
You must use DBD:SQLite2 to access the database.

Users of SSL on non-Windows platforms should NOT use IO::Socket::SSL
v0.97 or v0.99. They are known to be incompatible with POPFile; v1.07
is the most recent release of IO::Socket::SSL that works correctly.


If you are upgrading from pre-v0.22.0 please read the v0.22.0 release
notes for much more information:


Thank you to everyone who has clicked the Donate! button and donated
their hard earned cash to me in support of POPFile. Thank you also to
the people who have contributed their time through patches, feature
requests, bug reports, user support and translations.


Big thanks to all who've contributed to POPFile over the last year.


Labels: ,

Monday, June 11, 2007

Measuring my inbox depth

Some time ago I wrote about how I manage the flow of email hitting me. Back then I didn't talk about what happens to the email after it's been filtered and automatically sorted.

Here's a quick picture of my email folders:

POPFile does the automatic sorting into the sub-folders when I download mail (read the article linked above for details on that) and I then manually move mail that I can't respond to immediately to the ACTION folder (to make this quick I use the lovely TB QuickMove Extension which makes all my mail moving a CTRL-# click away).

Anything that's in ACTION needs to be dealt with. Sometimes that means those messages will be gone in a day, sometimes in weeks, just depends on the contents. There are also a couple of other 'action' type folders called Next Newsletter (where I store interesting stuff to mention in my next spam and anti-spam newsletter) and Next GNU Make (where I store stuff that'll go into next month's CM Basics Mr Make column).

It occurred to me that the depth of my inbox might be an interesting thing to track so I wrote a quick Perl script that parses the mbox file associated with the ACTION folder looking for non-deleted messages (I look at the X-Mozilla-Status header and check that the 0x8 bit is not set) to get a count of the messages in ACTION.

Then I pump that data into rrdtool using the RRD::Simple interface from CPAN and use it to create graphs of the number of waiting items. All this runs off a cron job every five minutes (and once an hour for graph creation).

You can see the 'live' graphs here (these will update hourly when my machine is on):

Labels: ,

Tuesday, May 15, 2007

Some architectural details of Signal Spam

Finally, Signal Spam, France's new national anti-spam system, launched and I'm able to talk about it. For a brief introduction in English start here.

I'm not responsible for the idea behind Signal Spam, nor for its organization, but I did write almost all the code used to run the site and the back end system. This blog post talks a little bit about the design of Signal Spam.

Signal Spam lets people send spams via either a web form, or a plug-in. Plug-ins are currently available for Outlook 2003, Outlook 2007 and Thunderbird 2.0; more coming. Currently Signal Spam does three things with every message: it keeps a copy in a database after having extracted information from the body and headers of the message; it figures out if the message came from an ISP in France and if so sends an automatic message to the ISP indicating that they've got a spammer or zombie in their network; it figures out if the message was actually a legitimate e-marketing message from a French mailer and informs the person reporting the spam of how to unsubscribe.

The original plan was that the system be capable of handling 1,000,000 messages per day allowing for peaks of up to 1000 messages per minute (such as when people first come to work in the morning) and that messages would be handled in near real-time (i.e. the time from a message being received by the system to it being analyzed and forwarded to an ISP would be under 60 seconds). Signal Spam also wanted a lot of flexibility in being able to scale the system up as use of the site grew and being able to do maintenance of the site without taking it down.

Here's the last 12 hours of activity on the site, which pretty much matches what we expected with a peak once people get to work and start reading their mail. (These charts are produced automatically by the lovely RRDTool.)

The system I built is split into two parts: the front end (everything that the general public sees including the API used by the plug-ins) and the back end (the actual database storing the messages sent, the software that does analysis and the administration interface). Communication between the front end and the back end uses a web service running over HTTPS.

To make things scale easily the back end is entirely organized around a simple queuing system. When a message arrives from a user it is immediately stored in the database (there are, in fact, two tables: one table contains the actual complete message as a BLOB and the other contains the fields extracted from the message. The messages have the same ID in each table and the non-BLOB table is used for searching and sorting).

Once stored in the database the message ID is added to a FIFO queue (which is actually implemented as a database table). An arbitrary number of processus handle message analysis by dequeuing IDs from the FIFO queue (using row-level locking so that only one process gets each ID). Once dequeued the message is analyzed: fields such as From, Subject, Date are extracted and stored in the database, the Received headers are walked using a combination of blacklist lookup and forgery detection to find the true IP address that injected the message into Internet, the IP address is mapped to the network that manages the IP address, fingerprints of the message are taken and all URLs inside the message are extracted.

Once the analysis is complete the process decides whether the message needs to be sent to an ISP. If so it enqueues the message ID on another FIFO queue for a separate forwarding process to handle. If the message is in fact a legitimate message then the message ID is enqueued on a FIFO queue for another response process to handle.

The forwarding process generates an ARF message to the appropriate ISP and sends the message for every ID that it dequeues from the queue using VERP for bounce or reponse handling.

The response process dequeues IDs and responsed to the original reporter of the spam with a message tailored for the specific e-marketer with full unsubscribe details.

The use of queues and a shared database to handle the queues, plus a simple locking strategy means that arbitrary numbers of processes can be added to handle the load on the system as required (currently there is only one process of each type running and handling all messages in the delay required). It also means that the processus do not need to be on the same machine and the system can scale by adding processus or adding hardware.

Stopping the processes does not stop the front end from operating. Messages will still be added to the database and the analysis queue will grow. In fact, the length of the queue makes measuring the health of the system trivial: just look at the length of the queue to see if we are keeping up or not.

Since the queue has the knowledge about the work to be done processus can be stopped and upgraded as needed without taking the system off line.

To hide all this the entire system (which is written in Perl---in fact, the back end is entirely LAMP) uses an object structure. For example, creating the Message object (passing the raw message into the constructor) performs the initial message analysis and queues the message for further analysis. Access to the database queues is entirely wrapped in a Queue object (constructor takes the queue name). These objects are dynamically loaded by Perl and can be upgraded as needed.

Finally, all the objects (and related scripts) have unit tests using Perl's Test::Class and the entire system can be tested with a quick 'make test'. One complexity is that most of the classes require access to the database. To work around this I have created a Test::Database class that is capable of setting up a complete MySQL system from scratch (assuming MySQL is currently installed) and loading the right schema, that is totally independent of any other MySQL instance. The class then returns a handle (DBI) to that instance plus a connect string. This means the unit tests are totally independent of having a running database.

In addition, the unit tests include a system that I created for POPFile which allows me to get line-by-line coverage information showing what's tested and what's not. By running 'make coverage' it's possible to run the test suite with coverage information. This gives percentage of lines tested and for every Perl module, class and script a corresponding HTML file is generated with lines colored green (if they were executed during testing) or red (if not). The coloring is achieved by hooking the Perl debugger (see this module from POPFile for details).

Here's an example of what that looks like (here I'm showing Classifier/Bayes.html which corresponds to Classifier/ in the POPFile code, but it looks the same in Signal Spam):

The green lines (with line numbers added) were executed by the test suite; the red line was not (you can see here that my test suite for POPFile didn't test the possibility that the database connect would fail).

Labels: ,

Wednesday, March 28, 2007

Code to decode an a.b.c.d/n IP address range to the list of IP addresses

I needed to map some addresses in the standard IP address prefix syntax to the actual list of addresses, and after I'd searched Google for all of 11 usecs for a suitable on-line decoder I hacked one up in Perl.

Since someone else might find this handy, here it is:

use strict;
use warnings;

my $address = $ARGV[0] || '/';

my ( $ip, $bits ) = split( /\//, $address );
my @octets = split( /\./, $ip );

if ( ( $address eq '' ) ||
( $bits eq '' ) ||
( $#octets != 3 ) ) {
die "Usage: ip-decode a.b.c.d/x\n";

my $base = ( ( $octets[0] * 256 + $octets[1] ) * 256 +
$octets[2] ) * 256 + $octets[3];

my $remaining = 32 - $bits;

for my $i (0..(2 ** $remaining) - 1) {
print_ip( $base + $i );

sub print_ip
my ( $address ) = @_;

my @octets;

for my $i (0..3) {
push @octets, ($address % 256);
$address >>= 8;

print "$octets[3].$octets[2].$octets[1].$octets[0]\n";

For example, if this script is called ip-decode and you want to decode the address prefix you type ip-decode and you'll get the output:

This script has undergone no testing at all...

Labels: ,

Friday, September 22, 2006

A Test::Class gotcha

I'm working on a project that involves building a prototype application in Perl. I've made extensive use of Perl's OO features and have a collection of classes that implement the mathematical calculations necessary to drive the web site running the application. Naturally, as I've been building the classes I've been building a unit test suite.

Since Test::Class is the closest thing Perl has to junit or cppunit I'm using it to test all the class methods in my Perl classes. Everything was looking good until I told the guy writing the server to integrate with my code. His code died with an error like this:
Can't locate object method "new" via package "Class::A" (perhaps you
forgot to load "Class::A" at Class/ line 147.
Taking a quick look inside Class::B revealed that it did try to create a new Class::A object and that, sure enough, there was no use Class::A; anywhere in Class::B. Easy enough bug to fix, but what left me scratching my head was why the unit test suite didn't show this.

For each class I have an equivalent test class (so there's Class::A::Test and Class::B::Test) which are loaded using a .t file which in turn is loaded with prove. The test classes all use Test::Class.

The classes are tested with a Makefile that does the following:
@prove classes.t
And classes.t consists of:
use strict;
use warnings;

use Class::A::Test;
use Class::B::Test;

Since the test suite for Class::A does a use Class::A; and the test suite for Class::B does a use Class::B; and the two test suites are loaded using use in classes.t, both Class::A and Class::B are loaded before running the tests. This means that the fact that use Class::A; was missing from Class::B is masked in the test suite.

The solution is to have two .t files one for each class so that only the class being tested is loaded. So I dumped classes.t and created class_a.t and class_b.t as follows:
use strict;
use warnings;

use Class::A::Test;

use strict;
use warnings;

use Class::B::Test;

and the Makefile is changed to do:
@prove class_a.t class_b.t
This now works correctly. The missing use Class::A; causes a fatal error in the test suite.


Thursday, April 20, 2006

Stateless web pages with hashes

Recently I've been working on a web application that requires some state to be passed between pages. I really didn't want to keep server side state and then give the user a cookie or some other token that I'd have to track in the server side application, age out if discarded etc.

I hit upon the idea of keeping everything in hidden fields passed between page transitions by form POSTs. Of course, the problem with hidden fields is that someone could fake the information and submit a form with their notion of state. For example, if this were a commerce application, someone could alter the contents of their own shopping cart and perhaps even the prices they have to pay.

To get around this problem I include two extra pieces of information in the form: the Unix epoch time when the form was delivered to the user and a hash that covers all the contents of the form. For example, a typical form might look like:
<form action= method=POST>
<input type=hidden name=hash value=ff9f5c4a0d10d7ab384ad0f95ff3727f>
<input type=hidden name=now value=1145516605>
<input type=hidden name=cart value=agwiji8973cnwiei938943>
<input type=submit value="Checkout" name=checkout>
Here the form contains a cart value that is just an encoded version of the contents of the user's cart (note I say encoded and not encrypted; there's no protection inherent in the string encoding the cart contents: it's just safe to be passed in a form).

The time the form was sent to the user is in the now field and is just the Unix epoch time when the page was generated on the server side.

The hash is an MD5 hash of the now, the cart, the IP address of the person who requested the page and a salt value known only to the web server. The salt prevents an attacker from generating their own hashes and hence faking the form values, but it means that the web server can verify the validity of the form.

The now value means that old forms can be timed out just be checking the epoch time against the value in the form. The hashing of the IP address means that only the person for whom the form was generated can submit it.

I'm sure this isn't new to anyone who's written web applications. And it appears that Steve Gibson over at GRC is doing something similar with his e-commerce system and there's apparently something called View State in ASP.

Anyone who is a web expert care to comment?