Surveying Income

For a long time now, I’ve been sceptical of the practice of finding out the average income in a country or state or city or locality by doing a random survey. The argument I’ve made is “whether you keep Mukesh Ambani in the sample or not makes a huge difference in your estimate”. So far, though, I hadn’t been able to make a proper mathematical argument.

In the course of writing a piece for Bloomberg Quint (my first for that publication), I figured out a precise mathematical argument. Basically, incomes are distributed according to a power law distribution, and the exponent of the power law means that variance is not defined. And hence the Central Limit Theorem isn’t applicable.

OK let me explain that in English. The reason sample surveys work is due to a result known as the Central Limit Theorem. This states that for a distribution with finite mean and variance, the average of a random sample of data points is not very far from the average of the population, and the difference follows a normal distribution with zero mean and variance that is inversely proportional to the number of points surveyed.

So if you want to find out the average height of the population of adults in an area, you can simply take a random sample, find out their heights and you can estimate the distribution of the average height of people in that area. It is similar with voting intention – as long as the sample of people you survey is random (and without bias), the average of their voting intention can tell you with high confidence the voting intention of the population.

This, however, doesn’t work for income. Based on data from the Indian Income Tax department, I could confirm (what theory states) that income in India follows a power law distribution. As I wrote in my piece:

The basic feature of a power law distribution is that it is self-similar – where a part of the distribution looks like the entire distribution.

Based on the income tax returns data, the number of taxpayers earning more than Rs 50 lakh is 40 times the number of taxpayers earning over Rs 5 crore.
The ratio of the number of people earning more than Rs 1 crore to the number of people earning over Rs 10 crore is 38.
About 36 times as many people earn more than Rs 5 crore as do people earning more than Rs 50 crore.

In other words, if you increase the income limit by a factor of 10, the number of people who earn over that limit falls by a factor between 35 and 40. This translates to a power law exponent between 1.55 and 1.6 (log 35 to base 10 and log 40 to base 10 respectively).

Now power laws have a quirk – their mean and variance are not always defined. If the exponent of the power law is less than 1, the mean is not defined. If the exponent is less than 2, then the distribution doesn’t have a defined variance. So in this case, with an exponent around 1.6, the distribution of income in India has a well-defined mean but no well-defined variance.

To recall, the central limit theorem states that the population mean follows a normal distribution with the mean centred at the sample mean, and a variance of \frac{\sigma^2}{n} where \sigma is the standard deviation of the underlying distribution. And when the underlying distribution itself is a power law distribution with an exponent less than 2 (as the case is in India), \sigma itself is not defined.

Which means the distribution of population mean around sample mean has infinite variance. Which means the sample mean tells you absolutely nothing!

And hence, surveying is not a good way to find the average income of a population.

Vlogging!

The first seed was sown in my head by Harish “the Psycho” J, who told me a few months back that nobody reads blogs any more, and I should start making “analytics videos” to increase my reach and hopefully hit a new kind of audience with my work.

While the idea was great, I wasn’t sure for a long time what videos I could make. After all, I’m not the most technical guy around, and I had no patience for making videos on “how to use regression” and stuff like that. I needed a topic that would be both potentially catchy and something where I could add value. So the idea remained an idea.

For the last four or five years, my most common lunchtime activity has been to watch chess videos. I subscribe to the Youtube channels of Daniel King and Agadmator, and most days when I eat lunch alone at home are spent watching their analyses of games. Usually this routine gets disrupted on Fridays when the wife works from home (she positively hates these videos), but one Friday a couple of months back I decided to ignore her anyway and watch the videos (she was in her room working).

She had come out to serve herself to another serving of whatever she had made that day and saw me watching the videos. And suddenly asked me why I couldn’t make such videos as well. She has seen me work over the last seven years to build what I think is a fairly cool cricket visualisation, and said that I should use it to make little videos analysing cricket matches.

And since then my constant “background process” has been to prepare for these videos. Earlier, Stephen Rushe of Cricsheet used to unfailingly upload ball by ball data of all cricket matches as soon as they were done. However, two years back he went into “maintenance mode” and has stopped updating the data. And so I needed a method to get data as well.

Here, I must acknowledge the contributions of Joe Harris of White Ball Analytics, who not only showed me the APIs to get ball by ball data of cricket matches, but also gave very helpful inputs on how to make the visualisation more intuitive, and palatable to the normal cricket fan who hasn’t seen such a thing before. Joe has his own win probability model based on ball by ball data, which I think is possibly superior to mine in a lot of scenarios (my model does badly in high-scoring run chases), though I’ve continued to use my own model.

So finally the data is ready, and I have a much improved visualisation to what I had during the IPL last year, and I’ve created what I think is a nice app using the Shiny package that you can check out for yourself here. This covers all T20 international games, and you can use the app to see the “story of each game”.

And this is where the vlogging comes in – in order to explain how the model works and how to use it, I’ve created a short video. You can watch it here:

While I still have a long way to go in terms of my delivery, you can see that the video has come out rather well. There are no sync issues, and you see my face also in one corner. This was possible due to my school friend Sunil Kowlgi‘s Outklip app. It’s a pretty easy to use Chrome app, and the videos are immediately available on the platform. There is quick YouTube integration as well, for you to upload them.

And this is not a one time effort – going forward I’ll be making videos of limited overs games analysing them using my app, and posting them on my Youtube channel (or maybe I’ll make a new channel for these videos. I’ll keep you updated). I hope to become a regular Vlogger!

So in the meantime, watch the above video. And give my app a spin. Soon I’ll be releasing versions covering One Day Internationals and franchise T20s as well.

 

Premier League Points Efficiency

It would be tautological to say that you win in football by scoring more goals than your opponent. What is interesting is that scoring more goals and letting in fewer works across games in a season as well, as data from the English Premier League shows.

We had seen an inkling of this last year, when I had showed that points in the Premier League were highly correlated with goal difference (96% R square for those that are interested). A little past the midway point of the current season and the correlation holds – 96% again.

In other words, a team’s goal difference (number of goals scored minus goals let in) can explain 96% of the variance in the number of points gained by the team in the season so far. The point of this post is to focus on the rest.

In the above image, the blue line is the line of best fit (or regression line). This line predicts the number of points scored by a team given their goal difference. Teams located above this line have been more efficient or lucky – they have got more points than their goal different would suggest. Teams below this line have been less efficient or unlucky – their goal difference has been distributed badly across games, leading to fewer points than the team should have got.

Manchester City seem to be extremely unlucky this season, in that they have scored about five fewer points than what their goal difference suggests. The other teams close to the top of the league are all above the line – showing they’ve been more efficient in the way their goals have been distributed (Spurs and Arsenal have been luckier than ManYoo, Chelski and Liverpool).

At the other end of the table, Huddersfield Town have been unlucky – their goal difference suggests they should have had four more points – a big difference for a relegation threatened team. Southampton, Newcastle and Crystal Palace are also in the same boat.

Finally, the use of goal difference is used to break ties in league tables is an attempt to undo the luck (or lack of it) that would have resulted in teams under- or over-performing in terms of points given the number of goals they’ve scored and let in. Some teams would have gotten much more (or less) points than deserved by sheer dint of their goals having been distributed better across matches (big losses and narrow wins). The use of goal difference is a small attempt to set that right.

Built by Shanks

This morning, I found this tweet by John Burn-Murdoch, a statistician at the Financial Times, about a graphic he had made for a Simon Kuper (of Soccernomics fame) piece on Jose Mourinho.

Burn-Murdoch also helpfully shared the code he had written to produce this graphic, through which I discovered ClubElo, a website that produces chess-style Elo ratings for football clubs. They have a free and open API, through which Burn-Murdoch got the data for the above graphic, and which I used to download all-time Elo ratings for all clubs available (I can be greedy that way).

So the first order of business was to see how Liverpool’s rating has moved over time. The initial graph looked interesting, but not very interesting, so I decided to overlay it with periods of managerial regimes (the latter data I got through wikipedia). And this is what the all-time Elo rating of Liverpool looks like.

It is easy to see that the biggest improvement in the club’s performance came under the long reign of Bill Shankly (no surprises there), who took them from the Second Division to winning the old First Division. There was  brief dip when Shankly retired and his assistant Bob Paisley took over (might this be the time when Paisley got intimidated by Shankly’s frequent visits to the club, and then asked him not to come any more?), but Paisley consolidated on Shankly’s improvement to lead the club to its first three European Cups.

Around 2010, when the club was owned by Americans Tom Hicks and George Gillett and on a decline in terms of performance, this banner became popular at Anfield.

The Yanks were subsequently yanked following a protracted court battle, to be replaced by another Yank (John W Henry), under whose ownership the club has done much better. What is also interesting from the above graph is the managerial change decisions.

At the time, Kenny Dalglish’s sacking at the end of the 2011-12 season (which ended with Liverpool losing the FA Cup final to Chelsea) seemed unfair, but the Elo rating shows that the club’s rating had fallen below the level when Dalglish took over (initially as caretaker). Then there was a steep ascent under Brendan Rodgers (leading to second in 2013-14), when Suarez bit and got sold and the team went into deep decline.

Again, we can see that Rodgers got sacked when the team had reverted to the rating that he had started off with. That’s when Jurgen Klopp came in, and thankfully so far there has been a much longer period of ascendance (which will hopefully continue). It is interesting to see, though, that the club’s current rating is still nowhere near the peak reached under Rafa Benitez (in the 2008-9 title challenge).

Impressed by the story that Elo Ratings could tell, I got data on all Premier League managers, and decided to repeat the analysis for all clubs. Here is what the analysis for the so-called “top 6” clubs returns:

We see, for example, that Chelsea’s ascendancy started not with Mourinho’s first term as manager, but towards the end of Ranieri’s term – when Roman Abramovich had made his investment. We find that Jose Mourinho actually made up for the decline under David Moyes and Louis van Gaal, and then started losing it. In that sense, Manchester United have got their sacking timing right (though they were already in decline by the time they finished last season in second place).

Manchester City also seem to have done pretty well in terms of the timing of managerial changes. And Spurs’s belief in Mauricio Pochettino, who started off badly, seems to have paid off.

I wonder why Elo Ratings haven’t made more impact in sports other than chess!

Bangalore names are getting shorter

The Bangalore Names Dataset, derived from the Bangalore Voter Rolls (cleaned version here), validates a hypothesis that a lot of people had – that given names in Bangalore are becoming shorter. From an average of 9 letters in the name for a male aged around 80, the length of the name comes down to 6.5 letters for a 20 year old male. 

What is interesting from the graph (click through for a larger version) is the difference in lengths of male and female names – notice the crossover around the age 25 or so. At some point in time, men’s names continue to become shorter while women’s names’ lengths stagnate.

So how are names becoming shorter? For one, honorific endings such as -appa, -amma, -anna, -aiah and -akka are becoming increasingly less common. Someone named “Krishnappa” (the most common name with the ‘appa’ suffix) in Bangalore is on average 56 years old, while someone named Krishna (the same name without the suffix) is on average only 44 years old. Similarly, the average age of people named Lakshmamma is 55, while that of everyone named Lakshmi is just 40.  while the average Lakshmi (same name no suffix) is just 40.

In fact, if we look at the top 12 male and female names with a honorific ending, the average age of the version without the ending is lower than that of the version with the ending. I’ve even graphed some of the distributions to illustrate this.

  In each case, the red line shows the distribution of the longer version of the name, and the blue line the distribution of the shorter version

In one of the posts yesterday, we looked at the most typical names by age in Bangalore. What happens when we flip the question? Can we define what are the “oldest” and “youngest” names? Can we define these based on the average age of people who hold that name? In order to rule out fads, let’s stick to names that are held by at least 10000 people each.

These graphs are candidates for my own Bad Visualisations Tumblr, but I couldn’t think of a better way to represent the data. These graphs show the most popular male and female names, with the average age of a voter with that given name on the X axis, and the number of voters with that name on the Y axis. The information is all in the X axis – the Y axis is there just so that names don’t overlap.

So Karthik is among the youngest names among men, with an average age among voters being about 28 (remember this is not the average age of all Karthiks in Bangalore – those aged below 18 or otherwise not eligible to vote have been excluded). On the women’s side, Divya, Pavithra and Ramya are among the “youngest names”.

At the other end, you can see all the -appas and -ammas. The “oldest male name” is Krishnappa, with an average age 56. And then you have Krishnamurthy and Narayana, which don’t have the -appa suffix but represent an old population anyway (the other -appa names just don’t clear the 10000 people cutoff).

More women’s names with the -amma suffix clear the 10000 names cutoff, and we can see that pretty much all women’s names with an average age of 50 and above have that suffix. And the “oldest female name”, subject to 10000 people having that name, is Muniyamma. And then you have Sarojamma and Jayamma and Lakshmamma. And a lot of other ammas.

What will be the oldest and youngest names we relax the popularity cutoff, and instead look at names with at least 1000 people? The five youngest names are Dhanush, Prajwal, Harshitha, Tejas and Rakshitha, all with an average age (among voters) less than 24. The five oldest names are Papamma, Kannamma, Munivenkatappa, Seethamma and Ramaiah.

This should give another indication of where names are headed in Bangalore!

Single Malt Recommendation App

Life is too short to drink whisky you don’t like.

How often have you found yourself in a duty free shop in an airport, wondering which whisky to take back home? Unless you are a pro at this already, you might want something you haven’t tried before, but don’t want to end up buying something you may not like. The names are all grand, as Scottish names usually are. The region might offer some clue, but not so much.

So I started on this work a few years back, when I first discovered this whisky database. I had come up with a set of tables to recommend what whisky is similar to what, and which single malts are the “most unique”. Based on this, I discovered that I might like Ardbeg. And I ended up absolutely loving it.

And ever since, I’ve carried a couple of tables in my Evernote to make sure I have some recommendations handy when I’m at a whisky shop and need to make a decision. But then the tables are not user friendly, and don’t typically tell you what you should buy, and what your next choice should be and so on .

To make things more user-friendly, I have built this app where all you need to enter is your favourite set of single malts, and it gives you a list of other single malts that you might like.

The data set is the same. I once again use cosine similarity to find the similarity of different whiskies. Except that this time I take the average of your favourite whiskies, and then look for the whiskies that are closest to that.

In terms of technologies, I’ve used this R package called Shiny to build the app. It took not more than half an hour of programming effort to build, and most of that was in actually building the logic, not the UI stuff.

So take it for a spin, and let me know what you think.

 

Book challenge update

At the beginning of this year, I took a break from Twitter (which lasted three months), and set myself a target to read at least 50 books during the calendar year. As things stand now, the number stands at 28, and it’s unlikely that I’ll hit my target, unless I count Berry’s story books in the list.

While I’m not particularly worried about my target, what I am worried about is that the target has made me see books differently. For example, I’m now less liable to abandon books midway – the sunk cost fallacy means that I try harder to finish so that I can add to my annual count. Sometimes I literally flip through the pages of the book looking for interesting things, in an attempt to finish it one way or the other (I did this for Ray Dalio’s Principles and Randall Munroe’s What If, both of which I rated lowly).

Then, the target being in terms of number of books per year means that I get annoyed with long books. Like it’s been nearly a month since I started Jonathan Wilson’s Angels with dirty faces , but I’m still barely 30% of the way there – a figure I know because I’m reading it on my Kindle.

Even worse are large books that I struggle to finish. I spent about a month on Bill Bryson’s At Home, but it’s too verbose and badly written and so I gave it up halfway through. I don’t know if I should put this in my reading challenge. A similar story is with Siddhartha Mukherjee’s The Emperor of All Maladies – this morning, I put it down for maybe the fourth time (I bought it whenever it was first published) after failing to make progress – it’s simply too dry for someone not passionate about the subject.

Oh, and this has been the big insight from this reading challenge – that I read significantly faster on Kindle than I do on physical books. Firstly, it’s easier to carry around. Secondly, I can read in the dark since I got myself a Kindle Paperwhite last year. One of the times when I read from my kindle is in the evening when I’m putting Berry to sleep, and that means I need to read in the dark with a device that doesn’t produce so much light. Then, the ability to control font size and easy page turns means that I progress so much faster – even when I stop to highlight and make notes (a feature I miss dearly when reading physical books; searchable notes are a game changer).

I also find that when I’m reading on Kindle, it’s easier to “put fight” to get through a book that is difficult to read but is insightful. That’s how managed to get through Diana Eck’s India: A Sacred Geography, and that’s the reason I made it a point to buy Jordan Peterson’s book on Kindle – I knew it would be a tough read and I would never be able to get through it if I were to read the physical version.

Finally, the time taken to finish a book follows a bimodal distribution. I either finish off the book in a day or two, or I take a month to finish it. For example, I went to Copenhagen for a holiday in August, and found a copy of Michael Lewis’s The Big Short in my AirBnB. I was there for three days but finished off in that time. On the other hand, 12 rules for life took over a month.