My love of applied maths and my good old DC days working for the graphics department of scientific publisher AGU combine to make me quite the data visualization junkie, as many of you have no doubt figured out by now. So I found this post on the Media Matters website to be a fascinating data point (ahem!) on the public’s perception of what makes an honest graph.
At issue here is a graph of unemployment figures from 2011 shown yesterday on Fox News. Notice that the data point for November is plotted incorrectly, giving the appearance of no improvement when in fact there was some.
Media Matters is right, obviously, in pointing out this discrepancy, although I thought that the additional graphic they included just to illustrate this simple point (not pictured here) was a tad unnecessary. What I found really interesting is the alternative they offered, which was put forth as an “honest chart:”
In the strictest sense, all of the data points are plotted where they should be, so everything appears on the up-and-up. But is that sufficient to make a graph honest?
Edward Tufte, a pioneer in the field of data visualizing and information design, states the following oft-referenced principle in his book The Visual Display of Quantitative Information:
The representation of numbers, as physically measured on the surface of the graphic itself, should be directly proportional to the quantities represented.
Tufte introduces a quantity to measure a graph’s performance versus this standard which he calls “lie factor,” defined as the effect shown on a graph divided by the effect shown in the data.
So how does Media Matters’ “honest” graph measure up? I’d argue that the right portion of the chart appears to show that the unemployment rate plummets between October and November, declining by about two thirds of the height of the chart. But in reality, the decrease is only (9.0%-8.6%)/9.0% = 4.4%. So the lie factor here would be 66%/4.4%. That means that this graph, with a lie factor of 15, overstates the effect of the data by a whopping 15 times!
This is likely an honest mistake, as Excel, R, and other commonly used plotting programs often default to using a y-axis range just larger than the data range, leading to the graph held up as the standard by media matters. But even an honest mistake can lead to a dishonest graph. Just for fun, I extended the y-axis to 0 so as to eliminate any lie factor; here’s the resulting graph:
The beginning of what appears to be a downtrend is visible in the latter half of the year here, with a decent but not earth-shattering drop between October and November, and numbers otherwise appearing rather high. This is probably a pretty accurate representation of unemployment numbers, but that’s just a lucky break for the most part. What if I were plotting, say, today’s hourly Dow Jones index values? This has fluctuated between 11,990 and 12,100 so far today, often wildly. But extending my y-axis to 0 would certainly mean my chart would fail to show these effects at all. Limiting it would introduce some degree of misrepresentation. So how should I set my y-axis values to show trends and fluctuations while at the same time not exaggerating any effects?
I think this is a difficult question, and one whose answer depends heavily on the situation. For the unemployment numbers, I think adding some context to the numbers is helpful, so I plotted lines representing the historic (1948-2010) average unemployment rate as well as all-time high and low values. Despite the fact that I introduced a (small in this case) lie factor by moving my y-intercept up to 2% in the new graph, I think this paints a pretty accurate picture.
Data visualization is a subtle art, and one that I fear too few people responsible for writing scientific publications, news articles, and the like fail to fully appreciate. As consumers of this information, I’d encourage everyone to look at these with a skeptical eye and check twice to be sure that they’re accurately representing the data.