Tuesday, May 15, 2007

Some architectural details of Signal Spam

Finally, Signal Spam, France's new national anti-spam system, launched and I'm able to talk about it. For a brief introduction in English start here.

I'm not responsible for the idea behind Signal Spam, nor for its organization, but I did write almost all the code used to run the site and the back end system. This blog post talks a little bit about the design of Signal Spam.

Signal Spam lets people send spams via either a web form, or a plug-in. Plug-ins are currently available for Outlook 2003, Outlook 2007 and Thunderbird 2.0; more coming. Currently Signal Spam does three things with every message: it keeps a copy in a database after having extracted information from the body and headers of the message; it figures out if the message came from an ISP in France and if so sends an automatic message to the ISP indicating that they've got a spammer or zombie in their network; it figures out if the message was actually a legitimate e-marketing message from a French mailer and informs the person reporting the spam of how to unsubscribe.

The original plan was that the system be capable of handling 1,000,000 messages per day allowing for peaks of up to 1000 messages per minute (such as when people first come to work in the morning) and that messages would be handled in near real-time (i.e. the time from a message being received by the system to it being analyzed and forwarded to an ISP would be under 60 seconds). Signal Spam also wanted a lot of flexibility in being able to scale the system up as use of the site grew and being able to do maintenance of the site without taking it down.

Here's the last 12 hours of activity on the site, which pretty much matches what we expected with a peak once people get to work and start reading their mail. (These charts are produced automatically by the lovely RRDTool.)



The system I built is split into two parts: the front end (everything that the general public sees including the API used by the plug-ins) and the back end (the actual database storing the messages sent, the software that does analysis and the administration interface). Communication between the front end and the back end uses a web service running over HTTPS.

To make things scale easily the back end is entirely organized around a simple queuing system. When a message arrives from a user it is immediately stored in the database (there are, in fact, two tables: one table contains the actual complete message as a BLOB and the other contains the fields extracted from the message. The messages have the same ID in each table and the non-BLOB table is used for searching and sorting).

Once stored in the database the message ID is added to a FIFO queue (which is actually implemented as a database table). An arbitrary number of processus handle message analysis by dequeuing IDs from the FIFO queue (using row-level locking so that only one process gets each ID). Once dequeued the message is analyzed: fields such as From, Subject, Date are extracted and stored in the database, the Received headers are walked using a combination of blacklist lookup and forgery detection to find the true IP address that injected the message into Internet, the IP address is mapped to the network that manages the IP address, fingerprints of the message are taken and all URLs inside the message are extracted.

Once the analysis is complete the process decides whether the message needs to be sent to an ISP. If so it enqueues the message ID on another FIFO queue for a separate forwarding process to handle. If the message is in fact a legitimate message then the message ID is enqueued on a FIFO queue for another response process to handle.

The forwarding process generates an ARF message to the appropriate ISP and sends the message for every ID that it dequeues from the queue using VERP for bounce or reponse handling.

The response process dequeues IDs and responsed to the original reporter of the spam with a message tailored for the specific e-marketer with full unsubscribe details.

The use of queues and a shared database to handle the queues, plus a simple locking strategy means that arbitrary numbers of processes can be added to handle the load on the system as required (currently there is only one process of each type running and handling all messages in the delay required). It also means that the processus do not need to be on the same machine and the system can scale by adding processus or adding hardware.

Stopping the processes does not stop the front end from operating. Messages will still be added to the database and the analysis queue will grow. In fact, the length of the queue makes measuring the health of the system trivial: just look at the length of the queue to see if we are keeping up or not.

Since the queue has the knowledge about the work to be done processus can be stopped and upgraded as needed without taking the system off line.

To hide all this the entire system (which is written in Perl---in fact, the back end is entirely LAMP) uses an object structure. For example, creating the Message object (passing the raw message into the constructor) performs the initial message analysis and queues the message for further analysis. Access to the database queues is entirely wrapped in a Queue object (constructor takes the queue name). These objects are dynamically loaded by Perl and can be upgraded as needed.

Finally, all the objects (and related scripts) have unit tests using Perl's Test::Class and the entire system can be tested with a quick 'make test'. One complexity is that most of the classes require access to the database. To work around this I have created a Test::Database class that is capable of setting up a complete MySQL system from scratch (assuming MySQL is currently installed) and loading the right schema, that is totally independent of any other MySQL instance. The class then returns a handle (DBI) to that instance plus a connect string. This means the unit tests are totally independent of having a running database.

In addition, the unit tests include a system that I created for POPFile which allows me to get line-by-line coverage information showing what's tested and what's not. By running 'make coverage' it's possible to run the test suite with coverage information. This gives percentage of lines tested and for every Perl module, class and script a corresponding HTML file is generated with lines colored green (if they were executed during testing) or red (if not). The coloring is achieved by hooking the Perl debugger (see this module from POPFile for details).

Here's an example of what that looks like (here I'm showing Classifier/Bayes.html which corresponds to Classifier/Bayes.pm in the POPFile code, but it looks the same in Signal Spam):



The green lines (with line numbers added) were executed by the test suite; the red line was not (you can see here that my test suite for POPFile didn't test the possibility that the database connect would fail).

Labels: ,

7 Comments:

Blogger abecc said...

Why did you decide to use a database to implement the queue?

1:55 PM  
Blogger John Graham-Cumming said...

It's currently implemented as a queue in the database because the database allows remote access (so that I can scale across many machines) and has useful properties of robustness. (And we already had a database in the mix).

Since the actual queue access is wrapped by the Queue object the actual implementation is irrelevant to the code that needs dequeue/enqueue operations and could easily be changed out.

Do you have recommendation for a better way?

John.

1:59 PM  
Blogger Jeff said...

I very much like the produce/queue/consumer concept. However, in this case your database is a central point of failure. To scale you said you'll just add consumers. I've taken a slightly different approach in my systems. All machines are producers/queues/consumers. This way there is no central point of failure.

2:33 PM  
Blogger John Graham-Cumming said...

Jeff,

I do agree that the database is a single point of failure (which, of course, we have to get around with replication). I'd be interested in knowing more about the architecture of your program.

Always open to learn something new.

John.

4:08 PM  
Blogger codeslinger said...

If Signal Spam gets popular, you will have a problem queuing the raw messages into a database, even MySQL. We eventually had to take the messages out of the database and just lay them on disk with pointers to their location going into the DB instead.

-- Toby DiPasquale

4:11 PM  
Blogger John Graham-Cumming said...

Toby,

Part of the motivation for structuring the database as I did (each message has a row in a table which contains crunched data (e.g. the source IP address, From line) and a totally separate table just with the messages themselves) plus the use of an object to disguise the actual storage was that I figured that the table size might cause us a problem and we'd need to go to disk at some point.

Any idea at what size MySQL gave up for you?

John.

4:16 PM  
Blogger Helix said...

I assume you're using InnoDB tables (considering the row-level locking in your FIFO queue).

I've never used InnoDB transactionnal tables, but my experience with MyISAM tables is that a table becomes slow to write because of the growing index to update for each row (or write), and this is worst with non-fixed width table.

I don't know how InnoDB tables are architectured (on file for the data, and a seperate one for the index ?) but because of the FIFO concept, you can delete any index.

Just do :
SELECT * FROM my_queue_table LIMIT 1; # id = 42
DELETE FROM my_queue_table WHERE id = 42;

Because there aren't any index, MySQL will perform a full table scan for the DELERTE query... but remember, it's a FIFO table, so the scan will only last to the first row, so you can do without the index.

You may also change tune the concurrent_insert param to 2 in order to have a non-blocking INSERT.

10:10 AM  

Post a Comment

Links to this post:

Create a Link

<< Home