Periodicals and Dashboards

The purpose of a dashboard is to give you a live view of what is happening with the system. Take for example the instrument it is named after – the car dashboard. It tells you at the moment what the speed of the car is, along with other indicators such as which lights are on, the engine temperature, fuel levels, etc.

Not all reports, however, need to be dashboards. Some reports can be periodicals. These periodicals don’t tell you what’s happening at a moment, but give you a view of what happened in or at the end of a certain period. Think, for example, of classic periodicals such as newspapers or magazines, in contrast to online newspapers or magazines.

Periodicals tell you the state of a system at a certain point in time, and also give information of what happened to the system in the preceding time. So the financial daily, for example, tells you what the stock market closed at the previous day, and how the market had moved in the preceding day, month, year, etc.

Doing away with metaphors, business reporting can be classified into periodicals and dashboards. And they work exactly like their metaphorical counterparts. Periodical reports are produced periodically and tell you what happened in a certain period or point of time in the past. A good example are company financials – they produce an income statement and balance sheet to respectively describe what happened in a period and at a point in time for the company.

Once a periodical is produced, it is frozen in time for posterity. Another edition will be produced at the end of the next period, but it is a new edition. It adds to the earlier periodical rather than replacing it. Periodicals thus have historical value and because they are preserved they need to be designed more carefully.

Dashboards on the other hand are fleeting, and not usually preserved for posterity. They are on the other hand overwritten. So whether all systems are up this minute doesn’t matter a minute later if you haven’t reacted to the report this minute, and thus ceases to be of importance the next minute (of course there might be some aspects that might be important at the later date, and they will be captured in the next periodical).

When we are designing business reports and other “business intelligence systems” we need to be cognisant of whether we are producing a dashboard or a periodical. The fashion nowadays is to produce everything as a dashboard, perhaps because there are popular dashboarding tools available.

However, dashboards are expensive. For one, they need a constant connection to be maintained to the “system” (database or data warehouse or data lake or whatever other storage unit in the business report sense). Also, by definition they are not stored, and if you need to store then you have to decide upon a frequency of storage which makes it a periodical anyway.

So companies can save significantly on resources (compute and storage) by switching from dashboards (which everyone seems to think in terms of) to periodicals. The key here is to get the frequency of the periodical right – too frequent and people will get bugged. Not frequent enough, and people will get bugged again due to lack of information. Given the tools and technologies at hand, we can even make reports “on demand” (for stuff not used by too many people).

Surveying Income

For a long time now, I’ve been sceptical of the practice of finding out the average income in a country or state or city or locality by doing a random survey. The argument I’ve made is “whether you keep Mukesh Ambani in the sample or not makes a huge difference in your estimate”. So far, though, I hadn’t been able to make a proper mathematical argument.

In the course of writing a piece for Bloomberg Quint (my first for that publication), I figured out a precise mathematical argument. Basically, incomes are distributed according to a power law distribution, and the exponent of the power law means that variance is not defined. And hence the Central Limit Theorem isn’t applicable.

OK let me explain that in English. The reason sample surveys work is due to a result known as the Central Limit Theorem. This states that for a distribution with finite mean and variance, the average of a random sample of data points is not very far from the average of the population, and the difference follows a normal distribution with zero mean and variance that is inversely proportional to the number of points surveyed.

So if you want to find out the average height of the population of adults in an area, you can simply take a random sample, find out their heights and you can estimate the distribution of the average height of people in that area. It is similar with voting intention – as long as the sample of people you survey is random (and without bias), the average of their voting intention can tell you with high confidence the voting intention of the population.

This, however, doesn’t work for income. Based on data from the Indian Income Tax department, I could confirm (what theory states) that income in India follows a power law distribution. As I wrote in my piece:

The basic feature of a power law distribution is that it is self-similar – where a part of the distribution looks like the entire distribution.

Based on the income tax returns data, the number of taxpayers earning more than Rs 50 lakh is 40 times the number of taxpayers earning over Rs 5 crore.
The ratio of the number of people earning more than Rs 1 crore to the number of people earning over Rs 10 crore is 38.
About 36 times as many people earn more than Rs 5 crore as do people earning more than Rs 50 crore.

In other words, if you increase the income limit by a factor of 10, the number of people who earn over that limit falls by a factor between 35 and 40. This translates to a power law exponent between 1.55 and 1.6 (log 35 to base 10 and log 40 to base 10 respectively).

Now power laws have a quirk – their mean and variance are not always defined. If the exponent of the power law is less than 1, the mean is not defined. If the exponent is less than 2, then the distribution doesn’t have a defined variance. So in this case, with an exponent around 1.6, the distribution of income in India has a well-defined mean but no well-defined variance.

To recall, the central limit theorem states that the population mean follows a normal distribution with the mean centred at the sample mean, and a variance of \frac{\sigma^2}{n} where \sigma is the standard deviation of the underlying distribution. And when the underlying distribution itself is a power law distribution with an exponent less than 2 (as the case is in India), \sigma itself is not defined.

Which means the distribution of population mean around sample mean has infinite variance. Which means the sample mean tells you absolutely nothing!

And hence, surveying is not a good way to find the average income of a population.

Vlogging!

The first seed was sown in my head by Harish “the Psycho” J, who told me a few months back that nobody reads blogs any more, and I should start making “analytics videos” to increase my reach and hopefully hit a new kind of audience with my work.

While the idea was great, I wasn’t sure for a long time what videos I could make. After all, I’m not the most technical guy around, and I had no patience for making videos on “how to use regression” and stuff like that. I needed a topic that would be both potentially catchy and something where I could add value. So the idea remained an idea.

For the last four or five years, my most common lunchtime activity has been to watch chess videos. I subscribe to the Youtube channels of Daniel King and Agadmator, and most days when I eat lunch alone at home are spent watching their analyses of games. Usually this routine gets disrupted on Fridays when the wife works from home (she positively hates these videos), but one Friday a couple of months back I decided to ignore her anyway and watch the videos (she was in her room working).

She had come out to serve herself to another serving of whatever she had made that day and saw me watching the videos. And suddenly asked me why I couldn’t make such videos as well. She has seen me work over the last seven years to build what I think is a fairly cool cricket visualisation, and said that I should use it to make little videos analysing cricket matches.

And since then my constant “background process” has been to prepare for these videos. Earlier, Stephen Rushe of Cricsheet used to unfailingly upload ball by ball data of all cricket matches as soon as they were done. However, two years back he went into “maintenance mode” and has stopped updating the data. And so I needed a method to get data as well.

Here, I must acknowledge the contributions of Joe Harris of White Ball Analytics, who not only showed me the APIs to get ball by ball data of cricket matches, but also gave very helpful inputs on how to make the visualisation more intuitive, and palatable to the normal cricket fan who hasn’t seen such a thing before. Joe has his own win probability model based on ball by ball data, which I think is possibly superior to mine in a lot of scenarios (my model does badly in high-scoring run chases), though I’ve continued to use my own model.

So finally the data is ready, and I have a much improved visualisation to what I had during the IPL last year, and I’ve created what I think is a nice app using the Shiny package that you can check out for yourself here. This covers all T20 international games, and you can use the app to see the “story of each game”.

And this is where the vlogging comes in – in order to explain how the model works and how to use it, I’ve created a short video. You can watch it here:

While I still have a long way to go in terms of my delivery, you can see that the video has come out rather well. There are no sync issues, and you see my face also in one corner. This was possible due to my school friend Sunil Kowlgi‘s Outklip app. It’s a pretty easy to use Chrome app, and the videos are immediately available on the platform. There is quick YouTube integration as well, for you to upload them.

And this is not a one time effort – going forward I’ll be making videos of limited overs games analysing them using my app, and posting them on my Youtube channel (or maybe I’ll make a new channel for these videos. I’ll keep you updated). I hope to become a regular Vlogger!

So in the meantime, watch the above video. And give my app a spin. Soon I’ll be releasing versions covering One Day Internationals and franchise T20s as well.

 

Premier League Points Efficiency

It would be tautological to say that you win in football by scoring more goals than your opponent. What is interesting is that scoring more goals and letting in fewer works across games in a season as well, as data from the English Premier League shows.

We had seen an inkling of this last year, when I had showed that points in the Premier League were highly correlated with goal difference (96% R square for those that are interested). A little past the midway point of the current season and the correlation holds – 96% again.

In other words, a team’s goal difference (number of goals scored minus goals let in) can explain 96% of the variance in the number of points gained by the team in the season so far. The point of this post is to focus on the rest.

In the above image, the blue line is the line of best fit (or regression line). This line predicts the number of points scored by a team given their goal difference. Teams located above this line have been more efficient or lucky – they have got more points than their goal different would suggest. Teams below this line have been less efficient or unlucky – their goal difference has been distributed badly across games, leading to fewer points than the team should have got.

Manchester City seem to be extremely unlucky this season, in that they have scored about five fewer points than what their goal difference suggests. The other teams close to the top of the league are all above the line – showing they’ve been more efficient in the way their goals have been distributed (Spurs and Arsenal have been luckier than ManYoo, Chelski and Liverpool).

At the other end of the table, Huddersfield Town have been unlucky – their goal difference suggests they should have had four more points – a big difference for a relegation threatened team. Southampton, Newcastle and Crystal Palace are also in the same boat.

Finally, the use of goal difference is used to break ties in league tables is an attempt to undo the luck (or lack of it) that would have resulted in teams under- or over-performing in terms of points given the number of goals they’ve scored and let in. Some teams would have gotten much more (or less) points than deserved by sheer dint of their goals having been distributed better across matches (big losses and narrow wins). The use of goal difference is a small attempt to set that right.

Built by Shanks

This morning, I found this tweet by John Burn-Murdoch, a statistician at the Financial Times, about a graphic he had made for a Simon Kuper (of Soccernomics fame) piece on Jose Mourinho.

Burn-Murdoch also helpfully shared the code he had written to produce this graphic, through which I discovered ClubElo, a website that produces chess-style Elo ratings for football clubs. They have a free and open API, through which Burn-Murdoch got the data for the above graphic, and which I used to download all-time Elo ratings for all clubs available (I can be greedy that way).

So the first order of business was to see how Liverpool’s rating has moved over time. The initial graph looked interesting, but not very interesting, so I decided to overlay it with periods of managerial regimes (the latter data I got through wikipedia). And this is what the all-time Elo rating of Liverpool looks like.

It is easy to see that the biggest improvement in the club’s performance came under the long reign of Bill Shankly (no surprises there), who took them from the Second Division to winning the old First Division. There was  brief dip when Shankly retired and his assistant Bob Paisley took over (might this be the time when Paisley got intimidated by Shankly’s frequent visits to the club, and then asked him not to come any more?), but Paisley consolidated on Shankly’s improvement to lead the club to its first three European Cups.

Around 2010, when the club was owned by Americans Tom Hicks and George Gillett and on a decline in terms of performance, this banner became popular at Anfield.

The Yanks were subsequently yanked following a protracted court battle, to be replaced by another Yank (John W Henry), under whose ownership the club has done much better. What is also interesting from the above graph is the managerial change decisions.

At the time, Kenny Dalglish’s sacking at the end of the 2011-12 season (which ended with Liverpool losing the FA Cup final to Chelsea) seemed unfair, but the Elo rating shows that the club’s rating had fallen below the level when Dalglish took over (initially as caretaker). Then there was a steep ascent under Brendan Rodgers (leading to second in 2013-14), when Suarez bit and got sold and the team went into deep decline.

Again, we can see that Rodgers got sacked when the team had reverted to the rating that he had started off with. That’s when Jurgen Klopp came in, and thankfully so far there has been a much longer period of ascendance (which will hopefully continue). It is interesting to see, though, that the club’s current rating is still nowhere near the peak reached under Rafa Benitez (in the 2008-9 title challenge).

Impressed by the story that Elo Ratings could tell, I got data on all Premier League managers, and decided to repeat the analysis for all clubs. Here is what the analysis for the so-called “top 6” clubs returns:

We see, for example, that Chelsea’s ascendancy started not with Mourinho’s first term as manager, but towards the end of Ranieri’s term – when Roman Abramovich had made his investment. We find that Jose Mourinho actually made up for the decline under David Moyes and Louis van Gaal, and then started losing it. In that sense, Manchester United have got their sacking timing right (though they were already in decline by the time they finished last season in second place).

Manchester City also seem to have done pretty well in terms of the timing of managerial changes. And Spurs’s belief in Mauricio Pochettino, who started off badly, seems to have paid off.

I wonder why Elo Ratings haven’t made more impact in sports other than chess!

Bangalore names are getting shorter

The Bangalore Names Dataset, derived from the Bangalore Voter Rolls (cleaned version here), validates a hypothesis that a lot of people had – that given names in Bangalore are becoming shorter. From an average of 9 letters in the name for a male aged around 80, the length of the name comes down to 6.5 letters for a 20 year old male. 

What is interesting from the graph (click through for a larger version) is the difference in lengths of male and female names – notice the crossover around the age 25 or so. At some point in time, men’s names continue to become shorter while women’s names’ lengths stagnate.

So how are names becoming shorter? For one, honorific endings such as -appa, -amma, -anna, -aiah and -akka are becoming increasingly less common. Someone named “Krishnappa” (the most common name with the ‘appa’ suffix) in Bangalore is on average 56 years old, while someone named Krishna (the same name without the suffix) is on average only 44 years old. Similarly, the average age of people named Lakshmamma is 55, while that of everyone named Lakshmi is just 40.  while the average Lakshmi (same name no suffix) is just 40.

In fact, if we look at the top 12 male and female names with a honorific ending, the average age of the version without the ending is lower than that of the version with the ending. I’ve even graphed some of the distributions to illustrate this.

  In each case, the red line shows the distribution of the longer version of the name, and the blue line the distribution of the shorter version

In one of the posts yesterday, we looked at the most typical names by age in Bangalore. What happens when we flip the question? Can we define what are the “oldest” and “youngest” names? Can we define these based on the average age of people who hold that name? In order to rule out fads, let’s stick to names that are held by at least 10000 people each.

These graphs are candidates for my own Bad Visualisations Tumblr, but I couldn’t think of a better way to represent the data. These graphs show the most popular male and female names, with the average age of a voter with that given name on the X axis, and the number of voters with that name on the Y axis. The information is all in the X axis – the Y axis is there just so that names don’t overlap.

So Karthik is among the youngest names among men, with an average age among voters being about 28 (remember this is not the average age of all Karthiks in Bangalore – those aged below 18 or otherwise not eligible to vote have been excluded). On the women’s side, Divya, Pavithra and Ramya are among the “youngest names”.

At the other end, you can see all the -appas and -ammas. The “oldest male name” is Krishnappa, with an average age 56. And then you have Krishnamurthy and Narayana, which don’t have the -appa suffix but represent an old population anyway (the other -appa names just don’t clear the 10000 people cutoff).

More women’s names with the -amma suffix clear the 10000 names cutoff, and we can see that pretty much all women’s names with an average age of 50 and above have that suffix. And the “oldest female name”, subject to 10000 people having that name, is Muniyamma. And then you have Sarojamma and Jayamma and Lakshmamma. And a lot of other ammas.

What will be the oldest and youngest names we relax the popularity cutoff, and instead look at names with at least 1000 people? The five youngest names are Dhanush, Prajwal, Harshitha, Tejas and Rakshitha, all with an average age (among voters) less than 24. The five oldest names are Papamma, Kannamma, Munivenkatappa, Seethamma and Ramaiah.

This should give another indication of where names are headed in Bangalore!

Single Malt Recommendation App

Life is too short to drink whisky you don’t like.

How often have you found yourself in a duty free shop in an airport, wondering which whisky to take back home? Unless you are a pro at this already, you might want something you haven’t tried before, but don’t want to end up buying something you may not like. The names are all grand, as Scottish names usually are. The region might offer some clue, but not so much.

So I started on this work a few years back, when I first discovered this whisky database. I had come up with a set of tables to recommend what whisky is similar to what, and which single malts are the “most unique”. Based on this, I discovered that I might like Ardbeg. And I ended up absolutely loving it.

And ever since, I’ve carried a couple of tables in my Evernote to make sure I have some recommendations handy when I’m at a whisky shop and need to make a decision. But then the tables are not user friendly, and don’t typically tell you what you should buy, and what your next choice should be and so on .

To make things more user-friendly, I have built this app where all you need to enter is your favourite set of single malts, and it gives you a list of other single malts that you might like.

The data set is the same. I once again use cosine similarity to find the similarity of different whiskies. Except that this time I take the average of your favourite whiskies, and then look for the whiskies that are closest to that.

In terms of technologies, I’ve used this R package called Shiny to build the app. It took not more than half an hour of programming effort to build, and most of that was in actually building the logic, not the UI stuff.

So take it for a spin, and let me know what you think.

 

Book challenge update

At the beginning of this year, I took a break from Twitter (which lasted three months), and set myself a target to read at least 50 books during the calendar year. As things stand now, the number stands at 28, and it’s unlikely that I’ll hit my target, unless I count Berry’s story books in the list.

While I’m not particularly worried about my target, what I am worried about is that the target has made me see books differently. For example, I’m now less liable to abandon books midway – the sunk cost fallacy means that I try harder to finish so that I can add to my annual count. Sometimes I literally flip through the pages of the book looking for interesting things, in an attempt to finish it one way or the other (I did this for Ray Dalio’s Principles and Randall Munroe’s What If, both of which I rated lowly).

Then, the target being in terms of number of books per year means that I get annoyed with long books. Like it’s been nearly a month since I started Jonathan Wilson’s Angels with dirty faces , but I’m still barely 30% of the way there – a figure I know because I’m reading it on my Kindle.

Even worse are large books that I struggle to finish. I spent about a month on Bill Bryson’s At Home, but it’s too verbose and badly written and so I gave it up halfway through. I don’t know if I should put this in my reading challenge. A similar story is with Siddhartha Mukherjee’s The Emperor of All Maladies – this morning, I put it down for maybe the fourth time (I bought it whenever it was first published) after failing to make progress – it’s simply too dry for someone not passionate about the subject.

Oh, and this has been the big insight from this reading challenge – that I read significantly faster on Kindle than I do on physical books. Firstly, it’s easier to carry around. Secondly, I can read in the dark since I got myself a Kindle Paperwhite last year. One of the times when I read from my kindle is in the evening when I’m putting Berry to sleep, and that means I need to read in the dark with a device that doesn’t produce so much light. Then, the ability to control font size and easy page turns means that I progress so much faster – even when I stop to highlight and make notes (a feature I miss dearly when reading physical books; searchable notes are a game changer).

I also find that when I’m reading on Kindle, it’s easier to “put fight” to get through a book that is difficult to read but is insightful. That’s how managed to get through Diana Eck’s India: A Sacred Geography, and that’s the reason I made it a point to buy Jordan Peterson’s book on Kindle – I knew it would be a tough read and I would never be able to get through it if I were to read the physical version.

Finally, the time taken to finish a book follows a bimodal distribution. I either finish off the book in a day or two, or I take a month to finish it. For example, I went to Copenhagen for a holiday in August, and found a copy of Michael Lewis’s The Big Short in my AirBnB. I was there for three days but finished off in that time. On the other hand, 12 rules for life took over a month.

Taking your audience through your graphics

A few weeks back, I got involved in a Twitter flamewar with Shamika Ravi, a member of the Indian Prime Minister’s Economic Advisory Council. The object of the argument was a set of gifs she had released to show different aspects of the Indian economy. Admittedly I started the flamewar. Guilty as charged.

Thinking about it now, this wasn’t the first time I was complaining about her gifs – I began my now popular (at least on Twitter) Bad Visualisations tumblr with one of her gifs.

So why am I so opposed to animated charts like the one in the link above? It is because they demand too much of the consumer’s attention and it is hard to get information out of them. If there is something interesting you notice, by the time you have had time to digest the information the graphic has moved several frames forward.

Animated charts became a thing about a decade ago following the late Hans Rosling’s legendary TED Talk. In this lecture, Rosling used “motion charts” (a concept he possibly invented) – which was basically a set of bubbles moving around a chart, as he sought to explain how the condition of the world has improved significantly over the years.

It is a brilliant talk. It is a very interesting set of statistics simply presented, as Rosling takes the viewers through them. And the last phrase is the most important – these motion charts work for Rosling because he talks to the audience as the charts play out. He pauses when there is some explanation to be made or the charts are at a key moment. He explains some counterintuitive data points exhibited by the chart.

And this is precisely how animated visualisations need to be done, and where they work – as part of a live presentation where a speaker is talking along with the charts and using them as visual aids. Take Rosling (or any other skilled speaker) away from the motion charts, though, and you will see them fall flat – without knowing what the key moments in the chart are, and without the right kind of annotations, the readers are lost and don’t know what to look for.

There are a large number of aids to speaking that can occasionally double up as aids to writing. Graphics and charts are one example. Powerpoint (or Keynote or Slides) presentations are another. And the important thing with these visual aids is that the way they work as an aid is very different from the way they work standalone. And the makers need to appreciate the difference.

In business school, we were taught to follow the 5 by 5 formula (or some such thing) while making slides – that a slide should have no more than five bullet points, and each point should have no more than five words. This worked great in school as most presentations we made accompanied our talks.

Once I started working (for a management consultancy), though, I realised this didn’t work there because we used powerpoint presentations as standalone written communications. Consequently, the amount of information on each slide had to be much greater, else the reader would fail to get any information out of it.

Conversely, a powerpoint presentation meant as a standalone document would fail spectacularly when used to accompany a talk, for there would be too much information on each slide, and massive redundancy between what is on the slide and what the speaker is saying.

The same classification applies to graphics as well. Interactive and animated graphics do brilliantly as part of speeches, since the speaker can control what the audience is seeing and make sure the right message gets across. As part of “print” (graphics shared standalone, like on Twitter), though, these graphics fail as readers fail to get information out of them.

Similarly, a dense well-annotated graphic that might do well in print can fail when used as a visual aid, since there will be too much information and audience will not be able to focus on either the speaker or the graphic.

It is all about the context.

Analytics for general managers

While good managers have always been required to be analytical, the level of analytical ability being asked of managers has been going up over the years, with the increase in availability of data.

Now, this post is once again based on that one single and familiar data point – my wife. In fact, if you want me to include more data in my posts, you should talk to me more.

Leaving that aside, my wife works as a mid-level manager for an extremely large global firm. She was recruited straight out of business school for a “MBA track” program. And from our discussions about her work in the first few months, one thing she did lots of was writing SQL queries. And she still spends a lot of her time writing queries and building Excel models.

This isn’t something she was trained for, or was tested on while being recruited. She did her MBA in a famously diverse global business school, the diversity of its student bodies implying the level of maths and quantitative methods being kept rather low. She was recruited as a “general manager”. Yet, in a famously data-driven company, she spends a considerable amount of time on quantitative stuff.

It wasn’t always like this. While analytical ability has what (in my opinion) set apart graduates of elite MBA programs from those of middling MBA programs, the level of quantitative ability expected out of MBAs (apart from maybe those in finance) wasn’t too high. You were expected to know to use spreadsheets. You were expected to know some rudimentary statistics- means and standard deviations and some basic hypothesis testing, maybe. And you were expected to be able to make managerial decisions based on numbers. That’s about it.

Over the years, though, as the corpus of data within (and outside) organisations has grown, and making decisions based on data has become fashionable (a brilliant thing as far as I’m concerned), the requirement from managers has grown as well. Now they are expected to do more with data, and aren’t always trained for that.

Some organisations have responded to this problem by supplying “data analysts” who are attached to mid level managers, so that the latter can outsource the analytical work to the former and spend most of their time on “managerial” stuff. The problem with this is twofold – it is hard to guarantee a good career path to this data analyst (which makes recruitment hard), and this introduces “friction” – the manager needs to tell the analyst what precise data and analysis she needs, and iterating on this can lead to a lot of time lost.

Moreover, as the size of the data has grown, the complexity of the analysis that can be done and the insights that can be produced has become greater as well. And in that sense, managers who have been able to adapt to the volume and complexity of data have a significant competitive advantage over their peers who are less comfortable with data.

So what does all this mean for general managers and their education? First, I would expect the smarter managers to know that data analysis ability is a competitive advantage, and so invest time in building that skill. Second, I know of some business schools that are making their MBA programs less quantitative, as their student body becomes more diverse and the recruitment body becomes less diverse (banks are recruiting far less nowadays). This is a bad move. In fact, business schools need to realise that a quantitative MBA program is more of a competitive advantage nowadays, and tune their programs accordingly, while not compromising on the diversity of the student intake.

Then, there is a generation of managers that got along quite well without getting its hands dirty with data. These managers will now get challenged by younger managers who are more conversant with data. It will be interesting to see how organisations deal with this dynamic.

Finally, organisations need to invest in training programs, to make sure that their general managers are comfortable with data, and analysis, and making use of internal and external data science resources. Interestingly enough (I promise I hadn’t thought of this when I started writing this post), my company offers precisely one such workshop. Get in touch if you’re interested!