## When a two-by-two ruins a scatterplot

The BBC has some very good analysis of the Brexit vote (how long back was that?), using voting data at the local authority level, and correlating it with factors such as ethnicity and educational attainment.

In terms of educational attainment, there is a really nice chart, that shows the proportion of voters who voted to leave against the proportion of population in the ward with at least a bachelor’s degree. One look at the graph tells you that the correlation is rather strong:

‘Source: http://www.bbc.com/news/uk-politics-38762034And then there is the two-by-two that is superimposed on this – with regions being marked off in pink and grey. The idea of the two-by-two must have been to illustrate the correlation – to show that education is negatively correlated with the “leave” vote.

But what do we see here? A majority of the points lie in the bottom left pink region, suggesting that wards with lower proportion of graduates were less likely to leave. And this is entirely the wrong message for the graph to send.

The two-by-two would have been useful had the points in the graph been neatly divided into clusters that could be arranged in a grid. Here, though, what the scatter plot shows is a nice negatively correlated linear relationship. And by putting those pink and grey boxes, the illustration is taking attention away from that relationship.

Instead, I’d simply put the scatter plot as it is, and maybe add the line of best fit, to emphasise the negative correlation. If I want to be extra geeky, I might also write down the $R^2$ next to the line, to show the extent of correlation!

## Anscombe’s Quartet and Analytics

Many “analytics professionals” or “quants” I know or have worked with have no hesitation in diving straight into a statistical model when they are faced with a problem, rather than trying to understand the data. However, that is not the way I work. Whenever I set out solving a new problem, I start with spending considerable time trying to get a feel of the data. There are many things I do to “feel” the data – look at a few lines of data, look at descriptive statistics of some of the variables and distributions of individual variables. The most powerful tool, however, that lets me get a feel for data is the humble scatterplot.

The beauty of the scatter plot is that it allows you to get a real feel for the data. Taking variables two at a time, it not only shows you how each of them is distributed but also how they are related to each other. Relationships that are not apparent when you look at the data become apparent when you graph them. I may not be wrong in saying that the scatterplot defines the direction and scope of your entire solution.

The problem with the debate on how analytics needs to be done is that it is loaded. A large majority of people who use statistics in their daily work dive straight into analysis without looking at the data. Perhaps they deem that looking at data is a waste of time? I have even seen pitch decks by extremely reputed software companies that propose solutions such as “we will solve this problem using Logistic Regression” without even having seen the data.

Let us take an example now. Take the following four data sets (my apologies for putting an image here):

Let us say you dive straight into the analysis. Like a good “analytics professional” you dive straight into regression. You may even do some descriptive statistics for each of the data sets along the way. And this is what you find (again, apologies for the image)

Do you conclude that the four data sets are the same? Pretty much identical statistics right? I wouldn’t be surprised if you were to publish that there is nothing to differentiate between these four data sets. Now, let us do a simple scatter plot of each of these data sets and check for ourselves:

Now, do you still think these data sets are identical? Now you know why I stress so much upon getting a feel for the data and drawing the humble scatter plot?

The data set I’ve used here is a rather famous one, and it is called Anscombe’s Quartet. The purpose of the data set is to precisely describe what I have in this post. That one needs to get a feel for the data before diving into the analysis. Draw scatter plots for every pair of variables. Understand the relationships, and let this understanding guide your further analysis. If one were able to perfectly analyze every piece of data by diving straight into a regression, the job of analytics might as well be outsourced to computers.

PS: it is a tragedy that when they teach visualization in school they don’t even mention the scatter plot. At a recent workshop I asked the participants to name the different kinds of graphs they knew. “Line”, “Bar” and “Pie” were the mots common answers. Not one answered “scatter plot”. Given the utility of this simple plot this is indeed tragic.