## Telling stories with data

I’m about 20% through with The Verdict by Prannoy Roy and Dorab Sopariwala. It’s a fascinating book, except for one annoyance – it is full of tables that serve no purpose but to break the flow of text.

I must mention that I’m reading the book on the Kindle, which means that the tables can pose a major annoyance. Text breaks off midway through one page, and the next couple of pages involve a table or two, with several lines of text explaining what’s in the table. And then the text continues. It makes for a rather disruptive reading experience. And some of the tables have just one data point – making one wonder why it has been inserted there at all.

This is not the first book that I’ve noticed that makes this mistake. Some of the sports analytics books I’ve read in recent times, such as The Numbers Game also make the same error (I read that in print, and still had the same disruption). Bhagwati and Panagariya’s Why Growth Matters is similarly unreadable. Tables abruptly inserted into the middle of text, leading to the reader losing flow in the reading.

Telling a data story in book length is a completely different challenge to telling one in article length. And telling a story with data is a complete art form. When you’re putting a table there, you need to be able to explain why that table is important to the story – rather than putting it there just because it seems more rigorous.

Also the exact placement of the table (something that can’t be controlled well in Kindle, but is easy to fix in either HTML or print) matters –  the table should be relevant to the piece of text immediately preceding and succeeding it, in a way that it doesn’t disrupt the reader’s flow. More importantly, the table should be able to add value at that particular point – perhaps building on something that has been described in the previous paragraph.

Book length makes it harder because people don’t normally expect tables and figures to disturb their reading flow when reading something of book length. Also, the book format means that it is not always possible to insert a table at a precise point (even in print, where pagination is an issue).

So how do you tell a book length story with data? Firstly, be very stingy about the data that you want to show – anything that doesn’t immediately add value should be banished to the appendix. Even the rigour, which academics might be particular about, can be pushed to the end notes (not footnotes, since those can be disruptive to flow as well, turning pages into half pages).

Then, once you know that showing a particular table or graph is inevitable to telling the story, put it either in the beginning or the end of a chapter. This way, it doesn’t break the reader’s flow. Then, refer to individual numbers in the middle of the text without having to put the entire table in there. Unless each and every data point in the table is important, banish it to the endnotes.

One other common mistake (I did it in my piece in Forbes published yesterday) is to put a big table and not talk about it. It only seeks to confuse the reader, who starts looking for explanations for everything in the table in later parts.

I guess authors and analysts tend to get possessive. If you have worked hard to produce insights from data, you seek to share as much of it as possible. And this can mean simply dumping data all the data in the piece without a regard for what the reader will do with it.

I’m making a note to myself to not repeat this mistake in future.

## The problem with spider charts

On FiveThirtyEight, Nate Silver has a piece looking ahead to the Democratic primaries ahead of the presidential elections in the US next year. I don’t know enough about US politics to comment on the piece itself, but what caught my eye is the spider chart describing the various Democratic nominees.

This is a standard spider chart that people who read business news should recognise, so the appearance of such a chart isn’t big news. What bothers me, though, is that a respected data journalist like Nate Silver is publishing such charts, especially in an article under his own name. For spider charts do a lousy job of conveying information.

Implicitly, you might think that the area of the pentagon (in this case) thus formed conveys the strength of a particular candidate. Leaving aside the fact that the human eye can judge areas less well than lengths, the area of a spider chart accurately shows “strength” only in one corner case – where the values along all five axes are the same.

In all other cases, such as in the spider charts  above, the area of the pentagon (or whatever-gon) thus formed depends on the order in which the factors are placed. For example, in this chart, why should black voters be placed between the asian/hispanic and millennials? Why should party loyalists lie between the asian/hispanics and the left?

I may not have that much insight into US politics, but it should be fairly clear that the ordering of the factors in this case has no particular sanctity. You should be able to jumble up the order of the axes and the information in the chart should remain the same.

The spider chart doesn’t work this way. If lengths of the “semidiagonals” (the five axes on which we are measuring) are $l_1, l_2, ... l_n$, the area of the polygon thus formed equals $\frac{1}{2} sin \frac{360}{n} (l_1.l_2 + l_2.l_3 + ... + l_n.l_1)$. It is not hard to see that for any value of $n \ge 4$, the ordering of the “axes” makes a material difference in the area of the chart.

Moreover, in this particular case, with the legend being shown only with one politician, you need to keep looking back and forth to analyse where a particular candidate lies in terms of support among the five big democrat bases. Also, the representation suggests that these five bases have equal strength in the Democrat support base, while the reality may be far from it (again I don’t have domain knowledge).

Spider charts can look pretty, which might make them attractive for graphic designers. They are just not so good in conveying information.

PS: for this particular data set, I would just go with bars with small multiples (call me boring if you may). One set of bar graphs for each candidate, with consistent colour coding and ordering among the bars so that candidates can be compared easily.

## Football Elo Application

This morning, I discovered the Club Elo Ratings, and promptly proceeded to analyse Liverpool FC’s performance over the years based on these ratings, and then correlated the performance by manager.

Then, playing around with the data of different clubs, I realised that there are plenty more stories to be told using this data, and they are best told by people who are passionate about their respective clubs. So the best thing I could do is to put the data out there (in a form similar to what I did for Liverpool), so that people can analyse how their clubs have performed over the years, and under different managers.

Sitting beside me as I was doing this analysis, my wife popped in with a pertinent observation. Now, she doesn’t watch football. She hates it that I watch so much football. Nevertheless, she has a strong eye for metrics. And watching me analyse club performance by manager, she asked me if I can analyse manager performance by club!

And so I’ve added that as well to the Shiny app that I’ve built. It might look a bit clunky, with two unrelate graphs, one on top of the other, but since the two are strongly related, it makes sense to have both in the same app. The managers listed in the bottom dropdown are those who have managed at least two clubs in the Premier League.

If you’re interested in Premier League football, you should definitely check out the app. I think there are some interesting insights to be gleaned (such as what I presented in this morning’s post).

## Built by Shanks

This morning, I found this tweet by John Burn-Murdoch, a statistician at the Financial Times, about a graphic he had made for a Simon Kuper (of Soccernomics fame) piece on Jose Mourinho.

Burn-Murdoch also helpfully shared the code he had written to produce this graphic, through which I discovered ClubElo, a website that produces chess-style Elo ratings for football clubs. They have a free and open API, through which Burn-Murdoch got the data for the above graphic, and which I used to download all-time Elo ratings for all clubs available (I can be greedy that way).

So the first order of business was to see how Liverpool’s rating has moved over time. The initial graph looked interesting, but not very interesting, so I decided to overlay it with periods of managerial regimes (the latter data I got through wikipedia). And this is what the all-time Elo rating of Liverpool looks like.

It is easy to see that the biggest improvement in the club’s performance came under the long reign of Bill Shankly (no surprises there), who took them from the Second Division to winning the old First Division. There was  brief dip when Shankly retired and his assistant Bob Paisley took over (might this be the time when Paisley got intimidated by Shankly’s frequent visits to the club, and then asked him not to come any more?), but Paisley consolidated on Shankly’s improvement to lead the club to its first three European Cups.

Around 2010, when the club was owned by Americans Tom Hicks and George Gillett and on a decline in terms of performance, this banner became popular at Anfield.

The Yanks were subsequently yanked following a protracted court battle, to be replaced by another Yank (John W Henry), under whose ownership the club has done much better. What is also interesting from the above graph is the managerial change decisions.

At the time, Kenny Dalglish’s sacking at the end of the 2011-12 season (which ended with Liverpool losing the FA Cup final to Chelsea) seemed unfair, but the Elo rating shows that the club’s rating had fallen below the level when Dalglish took over (initially as caretaker). Then there was a steep ascent under Brendan Rodgers (leading to second in 2013-14), when Suarez bit and got sold and the team went into deep decline.

Again, we can see that Rodgers got sacked when the team had reverted to the rating that he had started off with. That’s when Jurgen Klopp came in, and thankfully so far there has been a much longer period of ascendance (which will hopefully continue). It is interesting to see, though, that the club’s current rating is still nowhere near the peak reached under Rafa Benitez (in the 2008-9 title challenge).

Impressed by the story that Elo Ratings could tell, I got data on all Premier League managers, and decided to repeat the analysis for all clubs. Here is what the analysis for the so-called “top 6” clubs returns:

We see, for example, that Chelsea’s ascendancy started not with Mourinho’s first term as manager, but towards the end of Ranieri’s term – when Roman Abramovich had made his investment. We find that Jose Mourinho actually made up for the decline under David Moyes and Louis van Gaal, and then started losing it. In that sense, Manchester United have got their sacking timing right (though they were already in decline by the time they finished last season in second place).

Manchester City also seem to have done pretty well in terms of the timing of managerial changes. And Spurs’s belief in Mauricio Pochettino, who started off badly, seems to have paid off.

I wonder why Elo Ratings haven’t made more impact in sports other than chess!

## Just Plot It

One of my favourite work stories is from this job I did a long time ago. The task given to me was demand forecasting, and the variable I needed to forecast was so “micro” (this intersection that intersection the other) that forecasting was an absolute nightmare.

A side effect of this has been that I find it impossible to believe that it’s possible to forecast anything at all. Several (reasonably successful) forecasting assignments later, I still dread it when the client tells me that the project in question involves forecasting.

Another side effect is that the utter failure of standard textbook methods in that monster forecasting exercise all those years ago means that I find it impossible to believe that textbook methods work with “real life data”. Textbooks and college assignments are filled with problems that when “twisted” in a particular way easily unravel, like a well-tied tie knot. Industry data and problems are never as clean, and elegance doesn’t always work.

Anyway, coming back to the problem at hand, I had struggled for several months with this monster forecasting problem. Most of this time, I had been using one programming language that everyone else in the company used. The code was simultaneously being applied to lots of different sub-problems, so through the months of struggle I had never bothered to really “look at” the data.

I must have told this story before, when I spoke about why “data scientists” should learn MS Excel. For what I did next was to load the data onto a spreadsheet and start looking at it. And “looking at it” involved graphing it. And the solution, or the lack of it, lay right before my eyes. The data was so damn random that it was a wonder that anything had been forecast at all.

It was also a wonder that the people who had built the larger model (into which my forecasting piece was to plug in) had assumed that this data would be forecast-able at all (I mentioned this to the people who had built the model, and we’ll leave that story for another occasion).

In any case, looking at the data, by putting it in a visualisation, completely changed my perspective on how the problem needed to be tackled. And this has been a learning I haven’t let go of since – the first thing I do when presented with data is to graph it out, and visually inspect it. Any statistics (and any forecasting for sure) comes after that.

Yet, I find that a lot of people simply fail to appreciate the benefits of graphing. That it is not intuitive to do with most programming languages doesn’t help. Incredibly, even Python, a favoured tool of a lot of “data scientists”, doesn’t make graphing easy. Last year when I was forced to use it, I found that it was virtually impossible to create a PDF with lots of graphs – something that I do as a matter of routine when working on R (I subsequently figured out a (rather inelegant) hack the next time I was forced to use Python).

Maybe when you work on data that doesn’t have meaningful variables – such as images, for example – graphing doesn’t help (since a variable on its own has little information). But when the data remotely has some meaning – sales or production or clicks or words, graphing can be of immense help, and can give you massive insight on how to develop your model!

So go ahead, and plot it. And I won’t mind if you fail to thank me later!

## Bangalore names are getting shorter

The Bangalore Names Dataset, derived from the Bangalore Voter Rolls (cleaned version here), validates a hypothesis that a lot of people had – that given names in Bangalore are becoming shorter. From an average of 9 letters in the name for a male aged around 80, the length of the name comes down to 6.5 letters for a 20 year old male.

What is interesting from the graph (click through for a larger version) is the difference in lengths of male and female names – notice the crossover around the age 25 or so. At some point in time, men’s names continue to become shorter while women’s names’ lengths stagnate.

So how are names becoming shorter? For one, honorific endings such as -appa, -amma, -anna, -aiah and -akka are becoming increasingly less common. Someone named “Krishnappa” (the most common name with the ‘appa’ suffix) in Bangalore is on average 56 years old, while someone named Krishna (the same name without the suffix) is on average only 44 years old. Similarly, the average age of people named Lakshmamma is 55, while that of everyone named Lakshmi is just 40.  while the average Lakshmi (same name no suffix) is just 40.

In fact, if we look at the top 12 male and female names with a honorific ending, the average age of the version without the ending is lower than that of the version with the ending. I’ve even graphed some of the distributions to illustrate this.

In one of the posts yesterday, we looked at the most typical names by age in Bangalore. What happens when we flip the question? Can we define what are the “oldest” and “youngest” names? Can we define these based on the average age of people who hold that name? In order to rule out fads, let’s stick to names that are held by at least 10000 people each.

These graphs are candidates for my own Bad Visualisations Tumblr, but I couldn’t think of a better way to represent the data. These graphs show the most popular male and female names, with the average age of a voter with that given name on the X axis, and the number of voters with that name on the Y axis. The information is all in the X axis – the Y axis is there just so that names don’t overlap.

So Karthik is among the youngest names among men, with an average age among voters being about 28 (remember this is not the average age of all Karthiks in Bangalore – those aged below 18 or otherwise not eligible to vote have been excluded). On the women’s side, Divya, Pavithra and Ramya are among the “youngest names”.

At the other end, you can see all the -appas and -ammas. The “oldest male name” is Krishnappa, with an average age 56. And then you have Krishnamurthy and Narayana, which don’t have the -appa suffix but represent an old population anyway (the other -appa names just don’t clear the 10000 people cutoff).

More women’s names with the -amma suffix clear the 10000 names cutoff, and we can see that pretty much all women’s names with an average age of 50 and above have that suffix. And the “oldest female name”, subject to 10000 people having that name, is Muniyamma. And then you have Sarojamma and Jayamma and Lakshmamma. And a lot of other ammas.

What will be the oldest and youngest names we relax the popularity cutoff, and instead look at names with at least 1000 people? The five youngest names are Dhanush, Prajwal, Harshitha, Tejas and Rakshitha, all with an average age (among voters) less than 24. The five oldest names are Papamma, Kannamma, Munivenkatappa, Seethamma and Ramaiah.

This should give another indication of where names are headed in Bangalore!

## Smashing the Law of Conservation of H

A decade and half ago, Ravikiran Rao came up with what he called the “law of conservation of H“. The concept has to do with the South Indian practice of adding a “H” to denote a soft consonant, a practice not shared by North Indians (Karthik instead of Kartik for example). This practice, Ravikiran claims, is balanced by the “South Indian” practice of using “S” instead of “Sh”, because of which the number of Hs in a name is conserved.

Ravikiran writes:

The Law of conservation of H states that the total number of H’s in the universe will be conserved. So the extra H’s that are added when Southies have to write names like Sunitha and Savitha are taken from the words Sasi and Sri Sri Ravisankar, thus maintaining a balance in the language.

Using data from the Bangalore first names data set (warning: very large file), it is clear that this theory doesn’t hold water, in Bangalore at least. For what the data shows is that not only do Bangaloreans love the “th” and “dh” for the soft T and D, they also use “sh” to mean “sh” rather than use “s” instead.

The most commonly cited examples of LoCoH are Swetha/Shweta and Sruthi/Shruti. In both cases, the former is the supposed “South Indian” spelling (with th for the soft T, and S instead of sh), while the latter is the “North Indian” spelling. As it turns out, in Bangalore, both these combinations are rather unpopular. Instead, it seems like if Bangaloreans can add a H to their name, they do. This table shows the number of people in Bangalore with different spellings for Shwetha and Shruthi (now I’m using the dominant Bangalorean spellings).

As you can see, Shwetha and Shruthi are miles ahead of any of the alternate ways in which the names can be spelt. And this heavy usage of H can be attributed to the way Kannada incorporates both Sanskrit and Dravidian history.

Kannada has a pretty large vocabulary of consonants. Every consonant has both the aspirated and unaspirated version, and voiced and unvoiced. There are three different S sounds (compared to Tamil which has none) and two Ls. And we need a way to transliterate each of them when writing in English. And while capitalising letters in the middle of a word (as per Harvard Kyoto convention) is not common practice, standard transliteration tries to differentiate as much as possible.

And so, since aspirated Tha and Dha aren’t that common in Kannada (except in the “Tha-Tha” symbols used by non-Kannadigas to show raised eyes), th and dh are used for the dental letters. And since Sh exists (and in two forms), there is no reason to substitute it with S (unlike Tamil). And so we have H everywhere.

Now, lest you were to think that I’m using just two names (Shwetha and Shruthi) to make my point, I dug through the names dataset to see how often names with interchangeable T and Th, and names with interchangeable S and Sh, appear in the Bangalore dataset. Here is a sample of both:

There are 13002 Karthiks registered to vote in Bangalore, but only 213 Kartiks. There are a hundred times as many Lathas as Latas. Shobha is far more common than Sobha, and Chandrashekhar much more common than Chandrasekhar.

So while other South Indians might conserve H, by not using them with S to compensate for using it with T and D, it doesn’t apply to Bangalore. Thinking about it, I wonder how a Kannadiga (Ravikiran) came up with this theory. Perhaps the fact that he has never lived in Karnataka explains it.

## The Comeback of Lakshmi

A few months back I stumbled upon this dataset of all voters registered in Bangalore. A quick scraping script followed by a run later, I had the names and addresses and voter IDs of all voters registered to vote in Bangalore in the state assembly elections held this way.

As you can imagine, this is a fantastic dataset on which we can do the proverbial “gymnastics”. To start with, I’m using it to analyse names in the city, something like what Hariba did with Delhi names. I’ll start by looking at the most common names, and by age.

Now, extracting first names from a dataset of mostly south indian names, since South Indians are quite likely to use initials, and place them before their given names (for example, when in India, I most commonly write my name as “S Karthik”). I decided to treat all words of length 1 or 2 as initials (thus missing out on the “Om”s), and assume that the first word in the name of length 3 or greater is the given name (again ignoring those who put their family names first, or those that have expanded initials in the voter set).

The most common male first name in Bangalore, not surprisingly, is Mohammed, borne by 1.5% of all male registered voters in the city. This is followed by Syed, Venkatesh, Ramesh and Suresh. You might be surprised that Manjunath doesn’t make the list. This is a quirk of the way I’ve analysed the data – I’ve taken spellings as given and not tried to group names by alternate spellings.

And as it happens, Manjunatha is in sixth place, while Manjunath is in 8th, and if we were to consider the two as the same name, they would comfortably outnumber the Mohammeds! So the “Uber driver Manjunath(a)” stereotype is fairly well-founded.

Coming to the women, the most common name is Lakshmi, with about 1.55% of all women registered to vote having that name. Lakshmi is closely followed by Manjula (1.5%), with Geetha, Lakshmamma and Jayamma coming some way behind (all less than 1%) but taking the next three spots.

Where it gets interesting is if we were to look at the most common first name by age – see these tables.

Among men, it’s interesting to note that among the younger age group (18-39, with exception of 35) and older age group (57+), Muslim names are the most common, while the intermediate range of 40-56 seeing Hindu names such as Venkatesh and Ramesh dominating (if we assume Manjunath and Manjunatha are the same, the combined name comes top in the entire 26-42 age group).

I find the pattern of most common women’s names more interesting. It is interesting to note that the -amma suffix seems to have been done away with over the years (suffixes will be analysed in a separate post), with Lakshmamma turning into Lakshmi, for example.

It is also interesting to note that for a long period of time (women currently aged 30-43), Lakshmi went out of fashion, with Manjula taking over as the most common name! And then the trend reversed, as we see that the most common name among 24-29 year old women in Lakshmi again! And that seems to have gone out of fashion once again, with “modern names” such as Divya, Kavya and Pooja taking over! Check out these graphs to see the trends.

(I’ve assumed Manjunath and Manjunatha are the same for this graph)

So what explains Manjunath and Manjula being so incredibly popular in a certain age range, but quickly falling away on both sides? Maybe there was a lot of fog (manju) over Bangalore for a few years? 😛

## Human, Animal and Machine Intelligence

Earlier this week I started watching this series on Netflix called “Terrorism Close Calls“. Each episode is about an instance of attempted terrorism that has been foiled in the last 2 decades. For example, there is one example of the plot to bomb a set of transatlantic flights from London to North America in 2006 (a consequence of which is that liquids still aren’t allowed on board flights).

So the first episode of the series involves this Afghani guy who drives all the way from Colorado to New York to place a series of bombs in the latter’s subways (metro train system). He is under surveillance through the length of his journey, and just as he is about to enter New York, he is stopped for what seems like a “routine drugs test”.

As the episode explains, “a set of dogs went around his car sniffing”, but “rather than being trained to sniff drugs” (as is routine in such a stop), “these dogs had been trained to sniff explosives”.

This little snippet got me thinking about how machines are “trained” to “learn”. At the most basic level, machine learning involves showing a large number of “positive cases” and “negative cases” based on which the program “learns” the differences between the positive and negative cases, and thus to identify the positive cases.

So if you want to built a system to identify cats in an image, you feed the machine a large number of images with cats in them, and a large(r) number of images without cats in them, each appropriately “labelled” (“cat” or “no cat”) and based on the differences, the system learns to identify cats.

Similarly, if you want to teach a system to detect cancers based on MRIs, you show it a set of MRIs that show malignant tumours, and another set of MRIs without malignant tumours, and sure enough the machine learns to distinguish between the two sets (you might have come across claims of “AI can cure cancer”. This is how it does it).

However, AI can sometimes go wrong by learning the wrong things. For example, an algorithm trained to recognise sheep started classifying grass as “sheep” (since most of the positive training samples had sheep in meadows). Another system went crazy in its labelling when an unexpected object (an elephant in a drawing room) was present in the picture.

While machines learn through lots of positive and negative examples, that is not how humans learn, as I’ve been observing as my daughter grows up. When she was very little, we got her a book with one photo each of 100 different animals. And we would sit with her every day pointing at each picture and telling her what each was.

Soon enough, she could recognise cats and dogs and elephants and tigers. All by means of being “trained on” one image of each such animal. Soon enough, she could recognise hitherto unseen pictures of cats and dogs (and elephants and tigers). And then recognise dogs (as dogs) as they passed her on the street. What absolutely astounded me was that she managed to correctly recognise a cartoon cat, when all she had seen thus far were “real cats”.

So where do animals stand, in this spectrum of human to machine learning? Do they recognise from positive examples only (like humans do)? Or do they learn from a combination of positive and negative examples (like machines)? One thing that limits the positive-only learning for animals is the limited range of their communication.

What drives my curiosity is that they get trained for specific things – that you have dogs to identify drugs and dogs to identify explosives. You don’t usually have dogs that can recognise both (specialisation is for insects, as they say – or maybe it’s for all non-human animals).

My suspicion (having never had a pet) is that the way animals learn is closer to how humans learn – based on a large number of positive examples, rather than as the difference between positive and negative examples. Just that the limit of the animal’s communication being limited means that it is hard to train them for more than one thing (or maybe there’s something to do with their mental bandwidth as well. I don’t know).

What do you think? Interestingly enough, there is a recent paper that talks about how many machine learning systems have “animal-like abilities” rather than coming close to human intelligence.

For millions of years, mankind lived, just like the animals.
And then something happened that unleashed the power of our imagination. We learned to talk
– Stephen Hawking, in the opening of a Roger Waters-less Pink Floyd’s Keep Talking

## Single Malt Recommendation App

Life is too short to drink whisky you don’t like.

How often have you found yourself in a duty free shop in an airport, wondering which whisky to take back home? Unless you are a pro at this already, you might want something you haven’t tried before, but don’t want to end up buying something you may not like. The names are all grand, as Scottish names usually are. The region might offer some clue, but not so much.

So I started on this work a few years back, when I first discovered this whisky database. I had come up with a set of tables to recommend what whisky is similar to what, and which single malts are the “most unique”. Based on this, I discovered that I might like Ardbeg. And I ended up absolutely loving it.

And ever since, I’ve carried a couple of tables in my Evernote to make sure I have some recommendations handy when I’m at a whisky shop and need to make a decision. But then the tables are not user friendly, and don’t typically tell you what you should buy, and what your next choice should be and so on .

To make things more user-friendly, I have built this app where all you need to enter is your favourite set of single malts, and it gives you a list of other single malts that you might like.

The data set is the same. I once again use cosine similarity to find the similarity of different whiskies. Except that this time I take the average of your favourite whiskies, and then look for the whiskies that are closest to that.

In terms of technologies, I’ve used this R package called Shiny to build the app. It took not more than half an hour of programming effort to build, and most of that was in actually building the logic, not the UI stuff.

So take it for a spin, and let me know what you think.