Thursday, December 17, 2009

Data Visualization Disease

A few days ago I moaned about an inaccurate and ininterpretable visualization appearing in a book touting its own excellence at visualization. Now, I'm pointed to a visualization of the recently released Met Office land surface temperature record that makes similar mistakes.

Folks, data visualization isn't about pretty colours, or slapping some data into a CSV and asking Excel to make you a line graph. It's about thinking about how the data needs to be interpreted and then creating an appropriate visualization. Many of the 'infoporn' graphics that adorn the blogs and magazines of the digerati (a pejorative term) are little more than the fantasies of a graphic designer sprinkled with some magical 'data' or 'statistics' pixie dust.

But these designers shouldn't be messing around with magic like that. They aren't trained to handle it, Hermione.

Here's the first graph from the blog. It appears to show that it's 10C hotter now than in the 1800s. Holy cow, Batman, the Earth's on fire!

It's all wrong.

All they've done is averaged the temperature readings from across the globe to try to get a sense of global warming. Averages are fun because any fool can calculate them, but pity the fool who averages without thinking. Some questions:

1. Did they ask themselves about the distribution of temperature readings across the globe to ensure that the average correctly reflected the entire Earth's surface? For example, are there lots of thermometers clustered close to each other that might bias the average?

2. Did they ponder the fact that there's much more land in the northern hemisphere, hence many more readings, hence without weighting the average is dominated by northern climes?

3. Did they ask themselves if an average is what you want? Is it reasonable to take the temperature in London in December and the temperature in Sydney in December and average them? Given that it's winter up north and summer down south what does an average tell you?

4. Did they ever ask themselves why the standard deviation is so freakin' huge (see the 2008 numbers in the graph above)?

No, they made a CSV file and graphed it. And since they get some 'warming' out of it they are happy.

This is what I call Data Visualization Disease. You grab some data, you think of a fancy (or not so fancy) way to show it. You shade that it in pastel colours you picked by wandering around Habitat, label it in a sans-serif font, and you're a God of visualization.

What they should have done is taken the thermometer readings, calculated a long term average for each location, calculated the difference between each reading and the average (to understand how much temperature has changed, not the absolute values), mapped those onto a grid laid across the Earth's surface, averaged (perhaps with variance adjustment) values from all the thermometers in each grid square to get a grid anomaly value, then produced a weighted average for the hemispheres based on weighting by the cosine of the latitude (since the grid box area varies with latitude) to get hemisphere averages.

Then they could have plotted that.

But there's no infoporn in doing that, that sounds like actual work, and worse, thinking. Phew! No, thanks. Pass the Crayola.

Update: since writing this rant I've seen that the blog I'm criticizing has listened to the complaints of people who pointed out similar problems.

Blogger Robert Kosara said...

You're right, I didn't think about the data a whole lot when I made this. Good call. I've made two other postings in the meantime that are much better (you're linking to the latest one, but there's also this one).

The latest one shows individual stations, so there is no more averaging. Once I've grokked that whole baseline thing better, I'll do one that aggregates the differences into a more cohesive chart.

2:14 PM  
OpenID junklight said...

Presumably also the data set from the 1800's (when travel was harder, records were paper and proper recording tech expensive ) contains a less diverse set of readings than modern ones (digital and technology for reading temperatures is pretty much disposable)

2:31 PM  
Blogger Gary P said...

EM Smith has published a note on how the mean latitude of the thermometers has moved toward the tropics.

This explains the increase in the average.

9:51 PM  

