data analysis – Pertinent Observations

Bad Data Analysis

This is a post tangentially related to work, so I must point out that all views here are my own, and not views of my employer or anyone else I’m associated with

The good thing about data analysis is that it’s inherently easy to do. The bad thing about data analysis is also that it’s inherently easy to do – with increasing data democratisation in companies, it is easier than ever than pulling some data related to your hypothesis, building a few pivot tables and charts on Excel and then presenting your results.

Why is this a bad thing, you may ask – the reason is that it is rather easy to do bad data analysis. I’m never tired of telling people who ask me “what does the data say?”, “what do you want it to say? I can make it say that”. This is not a rhetorical statement. As the old saying goes, you can “take data down into the basement and torture it until it confesses to your hypothesis”.

So, for example, when I hire analysts, I don’t check as much for the ability to pull and analyse data (those can be taught) as I do for their logical thinking skills. When they do a piece of data analysis, are they able to say that it makes sense or not? Can they identify that some correlations data shows are spurious? Are they taking ratios along the correct axis (eg. “2% of Indians are below the poverty line”, versus “20% of the world’s poor is in India”)? Are they controlling for instrumental variables?

This is the real skill in analytics – are you able to draw logical and sensible conclusions from what the data says? It is no coincidence that half my team at my current job has been formally trained in economics.

One of the externalities of being a head of analytics is that you come across a lot of bad data analysis – you are yourself responsible for some of it, your team is responsible for some more and given the ease of analysing data, there is a lot from everyone else as well.

And it becomes part of your job to comment on this analysis, to draw sense from it, and to say if it makes sense or not. In most cases, the analysis itself will be immaculate – well written queries and logic / code. The problem, almost all the time, is in the logic used.

I was reading this post by Nabeel Qureshi on puzzles. There, he quotes a book on chess puzzles, and talks about the differences between how experts approach a problem compared to novices.

The lesson I found the most striking is this: there’s a direct correlation between how skilled you are as a chess player, and how much time you spend falsifying your ideas. The authors find that grandmasters spend longer falsifying their idea for a move than they do coming up with the move in the first place, whereas amateur players tend to identify a solution and then play it shortly after without trying their hardest to falsify it first. (Often amateurs, find reasons for playing the move — ‘hope chess’.)

Call this the ‘falsification ratio’: the ratio of time you spend trying to falsify your idea to the time you took coming up with it in the first place. For grandmasters, this is 4:1 — they’ll spend 1 minute finding the right move, and another 4 minutes trying to falsify it, whereas for amateurs this is something like 0.5:1 — 1 minute finding the move, 30 seconds making a cursory effort to falsify it.

It is the same in data analysis. If I think about the amount of time I spend in analysing data, a very very large percentage of it (can’t put a number since I don’t track my time) goes in “falsifying it”. “Does this correlation make sense?”; “Have I taken care of all the confounding variables?”; “Does the result hold if I take a different sample or cut of data?”. “Has the data I’m using been collected properly?”; “Are there any biases in the data that might be affecting the result?”; And so on.

It is not an easy job. One small adjustment here or there, and the entire recommendations might flip. Despite being rigorous with the whole process, you can leave in some inaccuracy. And sometimes what your data shows may not conform to the counterparty (who has much better domain knowledge)’s biases – and so you have a much harder job selling it.

And once again – when someone says “we have used data, so we have been rigorous about the process”, it is more likely that they are more wrong.

The Comeback of Lakshmi

A few months back I stumbled upon this dataset of all voters registered in Bangalore. A quick scraping script followed by a run later, I had the names and addresses and voter IDs of all voters registered to vote in Bangalore in the state assembly elections held this way.

As you can imagine, this is a fantastic dataset on which we can do the proverbial “gymnastics”. To start with, I’m using it to analyse names in the city, something like what Hariba did with Delhi names. I’ll start by looking at the most common names, and by age.

Now, extracting first names from a dataset of mostly south indian names, since South Indians are quite likely to use initials, and place them before their given names (for example, when in India, I most commonly write my name as “S Karthik”). I decided to treat all words of length 1 or 2 as initials (thus missing out on the “Om”s), and assume that the first word in the name of length 3 or greater is the given name (again ignoring those who put their family names first, or those that have expanded initials in the voter set).

The most common male first name in Bangalore, not surprisingly, is Mohammed, borne by 1.5% of all male registered voters in the city. This is followed by Syed, Venkatesh, Ramesh and Suresh. You might be surprised that Manjunath doesn’t make the list. This is a quirk of the way I’ve analysed the data – I’ve taken spellings as given and not tried to group names by alternate spellings.

And as it happens, Manjunatha is in sixth place, while Manjunath is in 8th, and if we were to consider the two as the same name, they would comfortably outnumber the Mohammeds! So the “Uber driver Manjunath(a)” stereotype is fairly well-founded.

Coming to the women, the most common name is Lakshmi, with about 1.55% of all women registered to vote having that name. Lakshmi is closely followed by Manjula (1.5%), with Geetha, Lakshmamma and Jayamma coming some way behind (all less than 1%) but taking the next three spots.

Where it gets interesting is if we were to look at the most common first name by age – see these tables.

Among men, it’s interesting to note that among the younger age group (18-39, with exception of 35) and older age group (57+), Muslim names are the most common, while the intermediate range of 40-56 seeing Hindu names such as Venkatesh and Ramesh dominating (if we assume Manjunath and Manjunatha are the same, the combined name comes top in the entire 26-42 age group).

I find the pattern of most common women’s names more interesting. It is interesting to note that the -amma suffix seems to have been done away with over the years (suffixes will be analysed in a separate post), with Lakshmamma turning into Lakshmi, for example.

It is also interesting to note that for a long period of time (women currently aged 30-43), Lakshmi went out of fashion, with Manjula taking over as the most common name! And then the trend reversed, as we see that the most common name among 24-29 year old women in Lakshmi again! And that seems to have gone out of fashion once again, with “modern names” such as Divya, Kavya and Pooja taking over! Check out these graphs to see the trends.

(I’ve assumed Manjunath and Manjunatha are the same for this graph)

So what explains Manjunath and Manjula being so incredibly popular in a certain age range, but quickly falling away on both sides? Maybe there was a lot of fog (manju) over Bangalore for a few years?

High dimension and low dimension data science

I’ve observed that there are two broad approaches that people take to getting information out of data. One approach is to simply throw a kitchen sink full of analytical techniques at the data. Without really trying to understand what the data looks like, and what the relationships may be, the analyst simply uses one method after another to try and get insight from the data. Along the way, a “model” will get built.

The other approach (which I’m partial to) involves understanding each variable, and relationship between variables as a first step to getting insight from the data. Here, too, a model might get built, but it will be conditional on the analyst’s view on what kind of a model might suit the data after looking at the data.

Considering that both these approaches are used by large numbers of analysts, it is highly likely that both are legitimate. Then what explains the fact that some analysts use one approach, and others use another? Having thought about it for a long time, I have a hypothesis – it depends on the kind of data being analysed. More precisely, it has to do with the dimensionality of the data.

The first approach (which one might classify as “machine learning”) works well when the data is of high dimensions – where the number of predictors that can be used for predictors is really large, of the order of thousands or larger. For example, even a seemingly low-resolution 32 by 32 pixel image, looked at as a data point, has 1024 dimensions (colour of the image at each of the 1024 pixels is a different dimension).

Moreover, in such situations, it is likely that the signal in the data doesn’t come from one, or, two, or a handful of predictors. In high dimension data science, the signal usually comes from complex interplay of data along various dimensions. And this kind of search is not something humans are fit for – it is best that the machines are left to “learn” the model by themselves, and so you get machine learning.

On the other hand, when the dimensionality of the dataset is low, it is possible (and “easy”) for an analyst to look at the interplay of factors in detail, and understand the data before going on to build the model. Doing so can help the analyst identify patterns in the data that may not be that apparent to a machine, and it is also likely that in such datasets, the signal lies with data along a small number of dimensions, where relatively simple manipulation will suffice. The low dimensionality also means that complex machine learning techniques are unlikely to contribute much in such cases.

As you might expect, from an organisational perspective, the solution is quite simple – to deploy high-dimension data scientists on high-dimension problems, and likewise with low-dimension data scientists. Since this distinction between high-dimension and low-dimension data scientists isn’t very well known, it’s quite possible that the scientists might be made to work on a problem of dimensionality that is outside of their comfort zone.

So when you have low dimensional data scientists faced with a large number of dimensions of data, you will see them use brute force to try and find signals in bivariate relationships in the data – an approach that will never work since the signal lies in a more complex interplay of dimensions.

On the other hand, when you put high dimension data scientists on a low dimension problem, you will either see them missing out on associations that a human could easily find but a machine might find hard to find, or they might unnecessarily “reduce the problem to a known problem” by generating and importing large amounts of data in order to turn it into a high dimension problem!

PS: I wanted to tweet this today but forgot. Basically, you use logistic regression when you think the signal is an “or” of conditions on some of the underlying variables. On the other hand, if you think the signal is more likely to be an “and” condition of certain conditions, then you should use decision trees!