## Correlation and causation

So I have this lecture on “smelling (statistical) bullshit” that I’ve delivered in several places, which I inevitably start with a lesson on how correlation doesn’t imply causation. I give a large number of examples of people mistaking correlation for causation, the class makes fun of everything that doesn’t apply to them, then everyone sees this wonderful XKCD cartoon and then we move on.

One of my favourite examples of correlation-causation (which I don’t normally include in my slides) has to do with religion. Praying before an exam in which one did well doesn’t necessarily imply that the prayer resulted in the good performance in the exam, I explain. So far, there has been no outward outrage at my lectures, but this does visibly make people uncomfortable.

Going off on a tangent, the time in life when I discovered to myself that I’m not religious was when I pondered over the correlation-causation issue some six or seven years back. Until then I’d had this irrational need to draw a relationship between seemingly unrelated things that had happened together once or twice, and that had given me a lot of mental stress. Looking at things from a correlation-causation perspective, however, helped clear up my mind on those things, and also made me believe that most religious activity is pointless. This was a time in life when I got immense mental peace.

Yet, for most of the world, it is not freedom from religion but religion itself that gives them mental peace. People do absurd activities only because they think these activities lead to other good things happening, thanks to a small number of occasions when these things have coincided, either in their own lives or in the lives of their ancestors or gurus.

In one of my lectures a few years back I had remarked that one reason why humans still mistake correlation for causation is religion – for if correlation did not imply causation then most of religious rituals would be rendered meaningless and that would render people’s lives meaningless. Based on what I observed today, however, I think I’ve got this causality wrong.

It’s not because of religion that people mistake correlation for causation. Instead, we’ve evolved to recognise patterns whenever we observe them, and a side effect of that is that we immediately assume causation whenever we see things happening together. Religion is just a special case of application of this correlation-causation second nature to things in real life.

So my daughter (who is two and a half) and I were standing in our balcony this evening, observing that it had rained heavily last night. Heavy rain reminded my daughter of this time when we had visited a particular aunt last week – she clearly remembered watching the heavy rain from this aunt’s window. Perhaps none of our other visits to this aunt’s house really registered in the daughter’s imagination (it’s barely two months since we returned to Bangalore, so admittedly there aren’t that many data points), so this aunt’s house is inextricably linked in her mind to rain.

And this evening because she wanted it to rain heavily again, the daughter suggested that we go visit this aunt once again. “We’ll go to Inna Ajji’s house and then it will start raining”, she kept saying. “Yes, it rained the last time it went there, but it was random. It wasn’t because we went there”, I kept saying. It wasn’t easy to explain it.

You know when you are about to have a kid you develop visions of how you’ll bring her up, and what you’ll teach her, and what she’ll say to “jack” the world. Back then I’d decided that I’d teach my yet-unborn daughter that “correlation does not imply causation” and she could use it use it against “elders” who were telling her absurd stuff.

I hadn’t imagined that mistaking correlation for causation is so fundamental to human nature that it would be a fairly difficult task to actually teach my daughter that correlation does not imply causation! Hopefully in the next one year I can convince her.

## Biases, statistics and luck

Tomorrow Liverpool plays Manchester City in the Premier League. As things stand now I don’t plan to watch this game. This entire season so far, I’ve only watched two games. First, I’d gone to a local pub to watch Liverpool’s visit to Manchester City, back in September. Liverpool got thrashed 5-0.

Then in October, I went to Wembley to watch Tottenham Hotspur play Liverpool. The Spurs won 4-1. These two remain Liverpool’s only defeats of the season.

I might consider myself to be a mostly rational person but I sometimes do fall for the correlation-implies-causation bias, and think that my watching those games had something to do with Liverpool’s losses in them. Never mind that these were away games played against other top sides which attack aggressively. And so I have this irrational “fear” that if I watch tomorrow’s game (even if it’s from a pub), it might lead to a heavy Liverpool defeat.

And so I told Baada, a Manchester City fan, that I’m not planning to watch tomorrow’s game. And he got back to me with some statistics, which he’d heard from a podcast. Apparently it’s been 80 years since Manchester City did the league “double” (winning both home and away games) over Liverpool. And that it’s been 15 years since they’ve won at Anfield. So, he suggested, there’s a good chance that tomorrow’s game won’t result in a mauling for Liverpool, even if I were to watch it.

With the easy availability of statistics, it has become a thing among football commentators to supply them during the commentary. And from first hearing, things like “never done this in 80 years” or “never done that for last 15 years” sounds compelling, and you’re inclined to believe that there is something to these numbers.

I don’t remember if it was Navjot Sidhu who said that statistics are like a bikini (“what they reveal is significant but what they hide is crucial” or something). That Manchester City hasn’t done a double over Liverpool in 80 years doesn’t mean a thing, nor does it say anything that they haven’t won at Anfield in 15 years.

Basically, until the mid 2000s, City were a middling team. I remember telling Baada after the 2007 season (when Stuart Pearce got fired as City manager) that they’d be surely relegated next season. And then came the investment from Thaksin Shinawatra. And the appointment of Sven-Goran Eriksson as manager. And then the youtube signings. And later the investment from the Abu Dhabi investment group. And in 2016 the appointment of Pep Guardiola as manager. And the significant investment in players after that.

In other words, Manchester City of today is a completely different team from what they were even 2-3 years back. And they’re surely a vastly improved team compared to a decade ago. I know Baada has been following them for over 15 years now, but they’re unrecognisable from the time he started following them!

Yes, even with City being a much improved team, Liverpool have never lost to them at home in the last few years – but then Liverpool have generally been a strong team playing at home in these years! On the other hand, City’s 18-game winning streak (which included wins at Chelsea and Manchester United) only came to an end (with a draw against Crystal Palace) rather recently.

So anyways, here are the takeaways:

1. Whether I watch the game or not has no bearing on how well Liverpool will play. The instances from this season so far are based on 1. small samples and 2. biased samples (since I’ve chosen to watch Liverpool’s two toughest games of the season)
2. 80-year history of a fixture has no bearing since teams have evolved significantly in these 80 years. So saying a record stands so long has no meaning or predictive power for tomorrow’s game.
3. City have been in tremendous form this season, and Liverpool have just lost their key player (by selling Philippe Coutinho to Barcelona), so City can fancy their chances. That said, Anfield has been a fortress this season, so Liverpool might just hold (or even win it).

All of this points to a good game tomorrow! Maybe I should just watch it!

## The problem with precedence

One common bureaucratic practice across bureaucracies and across countries is that of “precedence”. If a certain action has “precedence” and the results of that preceding actions have been broadly good, that action immediately becomes kosher. However, from the point of view of logical consistency, there are several problems with this procedure.

The first issue is that of small samples – if there is a small number of times a certain action has been tried in the past, the degree of randomness associated with the result of that action is significant. Thus, relying on the result of a handful of instances of prior action is not likely to be reliable.

The second, and related, issue is that of correlation and causation. That the particular action in the past was associated with a particular result doesn’t necessarily mean that the result, whether good or bad, was a consequence of the action. The question we need to ask in this case is whether the result was because of or in spite of the action. It is well possible that a lousy policy in the past led to good results thanks to a favourable market environment. It is also equally possible that a fantastic policy led to lousy results because of a lousy environment.

Thus, when we evaluate precedence, we should evaluate the process and methodology, rather than result. We should accept that the action alone can never fully explain the result of the action, and thus evaluate the action in light of the prevailing conditions, etc. rather than just by the result.

It is going to take significant bureaucratic rethinking to accept this, but unless this happens it is unlikely that a bureaucracy can function effectively.

## Commute Distance and Prosperity

There is an interesting report on The Hindu Blogs about commute distance and prosperity. Referring to a World Bank report in 2005, the blog post talks about richer people commuting longer distances to work. Rukmini S, who has written the piece, also finds from the latest NSSO data that richer states in India have a higher proportion of people commuting more than 5 km to work.

I didn’t like the visualization (or the lack of it) in Rukmini’s article, and hence this post. I thought the point about long commutes to work and richer states would be better made in a scatter plot, and that is what I produce here:

On the X axis is the proportion of the Urban population in each state that commutes over 5 km to work each way. The data is from the latest NSSO Survey (page 28-29). On the Y axis I have a measure of the level of economic activity in a state – the per capita Gross State Domestic Product. The advantage of this measure is that it takes out from the equation the size of the state itself, and instead focuses on the level of economic activity per person. The figures are from 2011-12 and the numbers are based on 2004-05 prices. The data is from the RBI website.

The correlation is clear – barring a few small states, the above plot clearly shoes that more the proportion of people that commute long distances to work, the greater the economic activity in that state. The question, however, is whether there is a causal effect and if so, in which direction – does people traveling longer distances cause greater economic activity or does greater economic activity lead to people commuting longer distances?

The world bank paper proposes that the more well to do commute longer distances only because the cost of local transport in Mumbai is high and the poor cannot afford that. This is a view that Rukmini endorses in her piece in the Hindu. The argument doesn’t particularly make sense, though. Do the world bank researchers intend to say that transport costs outstrip housing costs in prime areas in Mumbai? If so, it is extremely hard to believe.

At the state level, one possible reason why people in richer states travel more is because greater economic activity happens in bigger urban agglomerations. The economic activity of a town or village is a super-linear function of the number of people living there. And when you have larger urban agglomerations, people tend to live farther from their workplaces, and thus commute more.

Again – this is a chicken and egg problem – a level of economic activity in a town or village leads to increase in population, which results in greater commutes. Increase in population leads to even greater economic activity, and this sets off a virtuous cycle. The 20-fold increase in Bangalore’s population in the last 70 years can be attested to this cycle, and it is hard to put a direction of causation to it.

The above explanation, however, doesn’t explain the following graph. This graph is identical to the one above except that here we look at the proportion of rural residents who commute over 5 km to work. And this is again positively correlated with economic activity!

What can possibly explain this? One way to explain this is that when people stay close to a town or city with high economic activity, they might prefer to participate in that rather than working in the village itself, and thus they might be commuting longer distances. States with high economic activity are likely to have a larger number of villages close to urban/semi-urban centres of high economic activity, and thus people are likely to travel longer distances.

When more people are willing to travel longer distances for work, it leads to people coming together to work at a higher rate than it normally happens in a village, and this leads to higher economci activity! Again, it is hard to put a directionality to the causation!