Thursday, March 11, 2010

My bio

Occasionally I get asked for some sort of official bio. Here's one people can use:

John Graham-Cumming is computer programmer and author. He studied mathematics and computation at Oxford and stayed for a doctorate in computer security. As a programmer he has worked in Silicon Valley and New York, and the UK and France. His open source POPFile program won a Jolt Productivity Award in 2004.

He is the author of a travel book for scientists called The Geek Atlas and has written articles for The Times, The Guardian, The Sunday Times, The San Francisco Chronicle and New Scientist.

He is CTO of Causata. He can be found on the web at jgc.org and on Twitter as @jgrahamc.


If you've heard of him at all, it's likely because in 2009 he successfully petitioned the British Government to apologize for the mistreatment of British mathematician Alan Turing.

Labels: ,

Friday, February 12, 2010

So you think machine learning is boring...

Here's something I wrote for the company blog.

If you say the words 'machine learning' to people they either look confused or bored. Since the promise of Artificial Intelligence evaporated in the 1970s, machine intelligence seems to be one of those things that's a perpetual 20 years away.

But computer industry insiders know that many forms of machine learning are at work all the time. The most common and visible are recommendation systems like the one on Amazon.com that comes up with suggestions for other books you might like. But even that doesn't express the true power of state of the art algorithms.

But a helicopter doing backflips and a walking robot do.

Read the rest.

Labels: ,

Thursday, October 22, 2009

Some real data about JavaScript tagging on web pages

Since March of this year I've been running a private web spider looking at the number of web tags on web pages belonging to the Fortune 1000 and the top 1,000 web sites by traffic. Using the spider I've been able to see which products are deployed where, and how those products are growing or shrinking.

The web tags being tracked are those used for ad serving, web analytics, A/B testing, audience measurement and similar.

The spider captures everything about the page, including screen shots, and I'm able to drill in to see the state of a page and all its includes at the time of spidering. Here's shot of Apple with all the detail that the spider keeps.



The first interesting thing is to look at the top 1,000 web sites by traffic and see how many different tags are deployed per page. The average is 2.21, but if you exclude those that have no tags at all then the average is 3.10. Here's the distribution of number of tags against percentage of sites.


And of course, it's possible to see the market share of various different products. Here are the top 10 that I am tracking. Google Analytics has an impressive 43% of the top 1,000 web sites by traffic.


Since I've been tracking over time it's also possible to watch the growth (and decline). Here's the growth in the average number of tags on a web page (excluding pages that have no tags) since March 2009.

Since I also keep all the JavaScript and HTML for a page it's a breeze to calculate page weights. Here's a chart showing the size of HTML and JavaScript for the top 1,000 web pages by traffic. The x-axis shows the size of the page (excluding images) in kilo- or megabytes. The y-axis is the percentage of sites in that band.


I was shocked when I saw that list and suspected a bug. How could there be web sites with megabytes of non-image content? It turned out that it wasn't a bug. For example, at the time of downloading the HTML and JavaScript for Gawker was over 1Mb.

In a previous post I showed in detail the tagging on a site and that 29% of the non-graphic content was JavaScript used for web tagging. Here's another chart showing what percentage of web page markup is included JavaScript (this can include stuff like jQuery and web tagging products).


The really surprising thing there is how much JavaScript there is on pages. For many pages it's the majority of non-graphic content. Take for example Subscene where the home page HTML is about 18k but then masses of JavaScript are included (including over 200k from Facebook, a similar amount from UPS and various other bits of code).

If you delve into the tags actually used by various products you'll see that the sizes of JavaScript used for them varies a lot. comScore's Beacon is tiny (just 866 bytes)!



Finally, you might be asking yourself which site had 16 different tags on it. The winner is the celebrity gossip site TMZ.

Labels:

Wednesday, October 14, 2009

What is jsHub?

Some time ago I blogged about a new open-source project I'm involved in called jsHub. Since then there's been a little bit of confusion about what jsHub is all about.

Hopefully, I can clear this up in this blog post with an example.



The Problem

The home page of World Wrestling Entertainment has a total of 11 pieces of JavaScript for tracking and ad-serving. If you have Ghostery installed in your browser it will tell you that that page contains the following:



Using our own internal tool we see that page contains DoubleClick, Google Analytics (which they include three times), LeadBack, Microsoft Atlas, Omniture, OpenX, Quantcast, Quigo AdSonar, Revenue Science, Tacoda and comScore Beacon.

The problems with having so many different pieces of tracking JavaScript are many:

1. They add to the page weight. In the case of WWE the HTML of the page is 54687 bytes (the total non-graphic content downloaded is 433211 bytes).

The JavaScript for tracking and ad-serving is a total of 125454. i.e. 29% of the non-graphic content of that page is JavaScript code used to track usage and serve ads.

2. They create a risk of data integrity problems.

A typical problem occurs when one piece of JavaScript works and sends tracking information back and another doesn't. This creates a discrepancy between products that is a problem when trying to reconcile page counts between say an analytics product and an advertising system.

This is not a theoretical problem. It's easy to have it occur because a page may be viewed and the user may hit stop while the page is loading. A piece of JavaScript near the top of the page may have executed, while a piece near the bottom has not.

Indeed, the page code for the WWE site contains the comment: <!-- Add Google anlytics after omniture --> indicating how important placement of JavaScript code is.

3. They add unnecessary processing time.

Just take a look at this shot of downloading all the JavaScript for the WWE page. This was taken using Firebug and shows how wasteful all that extra code is in terms of download time, and execution time.



4. They are next to impossible to check for security problems.

The only option the web master of WWE.com has is to run all the JavaScript he receives from third-parties through something like Google Caja to ensure that it's safe, or insist that they are ADsafe.

Here, for example, is a section of code from Ominture's tracker used on the WWE page:



But the web master actually doesn't have that luxury because typically the JavaScript is being loaded remotely from the analytics vendor's or ad-server's web site and the end web master has no control at all over what's being loaded. Just look at what happened to the New York Times when a malicious ad turned into malware.

In the case of WWE there are 24 includes of JavaScript code from web sites that WWE do not control. And because of the browser security model all these pieces of code are getting equal access to the page.

5. End-users have no way to understand what they are doing.

Although programs like Ghostery are excellent they can't tell you what's actually happening inside that JavaScript. For example, there's no easy way for an end-user to determine what information is being gathered, or where it's being sent.

There is a tool called WASP but it's aimed at people debugging web site tagging problems, not at the privacy-aware consumer. Here's what WASP says about Tacoda on the WWE web site:



6. They represent duplicated effort as vendors are forced to write and maintain their own JavaScript code.

For example, all those tags have to find a way to send data back to their respective servers meaning there's duplicated code that has to be tested on a wide range of browsers to ensure that it all works.

7. Their inner working are often obscure.

See above!

Enter jsHub

jsHub is designed to eliminate these problems. It's a single piece of JavaScript (a "tag") that can handle reading different sorts of page information and then send them to many different vendors' products. One piece of code to send to Google Analytics, Omniture SiteCatalyst, WebTrends and Mixpanel.

Instead of one piece of JavaScript per vendor, jsHub has a single piece of code (the "hub") and plugins that know how to translate into the required wire protocol for each vendor. Vendors only maintain the plugin for their product.

With one piece of code the page weight is less, there's no danger of one product getting a page view and another not and processing time is reduced.

Then to make the entire thing debuggable and easy for an end-user to understand, there's the tag inspector. It's a user interface that talks to the jsHub tag and interrogates its operation. That way a user can see what's being gathered on a page, and who is receiving it.



Since the entire project is open-source it's possible to inspect the code to ensure that it is well written and secure. And it's licensed under a BSD-license so that it's open and includeable everywhere.

To further ensure that the code is of high-quality (and can handle all the different types of browsers that it might be executed in), there's a complete test suite and cross-browser testing system.

To make exactly what data is being gathered clear we are also proposing (public domain) standards for marking up page metadata using microformats. Our proposed standard is called hPage.

We're just getting started with jsHub. It's running on a small number of sites and we're working to build vendor interest. We strongly believe that a shared, open-source tag is the best solution for the entire web world.

If you want to get involved, contact the team.

Labels: ,