data – Page 5 – Pertinent Observations

Taking Intelligence For Granted

There was a point in time when the use of artificial intelligence or machine learning or any other kind of intelligence in a product was a source of competitive advantage and differentiation. Nowadays, however, many people have got so spoiled by the use of intelligence in many products they use that it has become more of a hygiene factor.

Take this morning’s post, for example. One way to look at it is that Spotify with its customisation algorithms and recommendations has spoiled me so much that I find Amazon’s pushing of Indian music irritating (Amazon’s approach can be called as “naive customisation”, where they push Indian music to me only because I’m based in India, and not learn further based on my preferences).

Had I not been exposed to the more intelligent customisation that Spotify offers, I might have found Amazon’s naive customisation interesting. However, Spotify’s degree of customisation has spoilt me so much that Amazon is simply inadequate.

This expectation of intelligence goes beyond product and service classes. When we get used to Spotify recommending music we like based on our preferences, we hold Netflix’s recommendation algorithm to a higher standard. We question why the Flipkart homepage is not customised to us based on our previous shopping. Or why Google Maps doesn’t learn that some of us don’t like driving through small roads when we can help it.

That customers take intelligence for granted nowadays means that businesses have to invest more in offering this intelligence. Easy-to-use data analysis and machine learning packages mean that at least some part of an industry uses intelligence in at least some form (even if they might do it badly in case they fail to throw human intelligence into the mix!).

So if you are in the business of selling to end customers, keep in mind that they are used to seeing intelligence everywhere around them, and whether they state it or not, they expect it from you.

More on statistics and machine learning

I’m thinking of a client problem right now, and I thought that something that we need to predict can be modelled as a function of a few other things that we will know.

Initially I was thinking about it from the machine learning perspective, and my thought process went “this can be modelled as a function of X, Y and Z. Once this is modelled, then we can use X, Y and Z to predict this going forward”.

And then a minute later I context switched into the statistical way of thinking. And now my thinking went “I think this can be modelled as a function of X, Y and Z. Let me build a quick model to see if the goodness of fit, and whether a signal actually exists”.

Now this might reflect my own biases, and my own processes for learning to do statistics and machine learning, but one important difference I find is that in statistics you are concerned about the goodness of fit, and whether there is a “signal” at all.

While in machine learning as well we look at what the predictive ability is (area under ROC curve and all that), there is a bit of delay in the process between the time we model and the time we look for the goodness of fit. What this means is that sometimes we can get a bit too certain about the models that we want to build without thinking if in the first place they make sense and there’s a signal in that.

For example, in the machine learning world, the concept of R Square is not defined for regression – the only thing that matters is how well you can predict out of sample. So while you’re building the regression (machine learning) model, you don’t have immediate feedback on what to include and what to exclude and whether there is a signal.

I must remind you that machine learning methods are typically used when we are dealing with really high dimensional data, and where the signal usually exists in the interplay between explanatory variables rather than in a single explanatory variable. Statistics, on the other hand, is used more for low dimensional problems where each variable has reasonable predictive power by itself.

It is possibly a quirk of how the two disciplines are practiced that statistics people are inherently more sceptical about the existence of signal, and machine learning guys are more certain that their model makes sense.

What do you think?

Data, football and astrology

Jonathan Wilson has an amusing article on data and football, and how many data-oriented managers in football have also been incredibly superstitious.

This is in response to BT Sport’s (one of the UK broadcasters of the Premier League) announcement of it’s “Unscripted” promotion where “some of the world’s foremost experts in both sports and artificial intelligence to produce a groundbreaking prophecy of the forthcoming season”.

Wilson writes:

I was reminded also of the 1982 film adaptation of Agatha Christie’s 1939 novel Murder is Easy in which a computer scientist played by Bill Bixby enters the details of the case into a programme he has coded to give the name of the murderer. As it turns out, the programmer knows this is nonsense and is merely trying to gauge the reaction of the heroine, played by Lesley-Anne Down, when her name flashes on the screen.

But this, of course, is not what data-based analysis is for. Its predictive element deals in probability not prophecy. It is not possessed of some oracular genius. (That said, it is an intriguing metaphysical question: what if you had all the data, not just ability and fitness, but every detail of players’ diet, relationships and mental state, the angle of blades of grass on the pitch, an assessment of how the breathing of fans affected air flow in the stadium … would the game’s course then be inevitable?)

This reminded me of my own piece that I wrote last year about how data science “is simply the new astrology“.

Periodicals and Dashboards

The purpose of a dashboard is to give you a live view of what is happening with the system. Take for example the instrument it is named after – the car dashboard. It tells you at the moment what the speed of the car is, along with other indicators such as which lights are on, the engine temperature, fuel levels, etc.

Not all reports, however, need to be dashboards. Some reports can be periodicals. These periodicals don’t tell you what’s happening at a moment, but give you a view of what happened in or at the end of a certain period. Think, for example, of classic periodicals such as newspapers or magazines, in contrast to online newspapers or magazines.

Periodicals tell you the state of a system at a certain point in time, and also give information of what happened to the system in the preceding time. So the financial daily, for example, tells you what the stock market closed at the previous day, and how the market had moved in the preceding day, month, year, etc.

Doing away with metaphors, business reporting can be classified into periodicals and dashboards. And they work exactly like their metaphorical counterparts. Periodical reports are produced periodically and tell you what happened in a certain period or point of time in the past. A good example are company financials – they produce an income statement and balance sheet to respectively describe what happened in a period and at a point in time for the company.

Once a periodical is produced, it is frozen in time for posterity. Another edition will be produced at the end of the next period, but it is a new edition. It adds to the earlier periodical rather than replacing it. Periodicals thus have historical value and because they are preserved they need to be designed more carefully.

Dashboards on the other hand are fleeting, and not usually preserved for posterity. They are on the other hand overwritten. So whether all systems are up this minute doesn’t matter a minute later if you haven’t reacted to the report this minute, and thus ceases to be of importance the next minute (of course there might be some aspects that might be important at the later date, and they will be captured in the next periodical).

When we are designing business reports and other “business intelligence systems” we need to be cognisant of whether we are producing a dashboard or a periodical. The fashion nowadays is to produce everything as a dashboard, perhaps because there are popular dashboarding tools available.

However, dashboards are expensive. For one, they need a constant connection to be maintained to the “system” (database or data warehouse or data lake or whatever other storage unit in the business report sense). Also, by definition they are not stored, and if you need to store then you have to decide upon a frequency of storage which makes it a periodical anyway.

So companies can save significantly on resources (compute and storage) by switching from dashboards (which everyone seems to think in terms of) to periodicals. The key here is to get the frequency of the periodical right – too frequent and people will get bugged. Not frequent enough, and people will get bugged again due to lack of information. Given the tools and technologies at hand, we can even make reports “on demand” (for stuff not used by too many people).

Telling stories with data

I’m about 20% through with The Verdict by Prannoy Roy and Dorab Sopariwala. It’s a fascinating book, except for one annoyance – it is full of tables that serve no purpose but to break the flow of text.

I must mention that I’m reading the book on the Kindle, which means that the tables can pose a major annoyance. Text breaks off midway through one page, and the next couple of pages involve a table or two, with several lines of text explaining what’s in the table. And then the text continues. It makes for a rather disruptive reading experience. And some of the tables have just one data point – making one wonder why it has been inserted there at all.

This is not the first book that I’ve noticed that makes this mistake. Some of the sports analytics books I’ve read in recent times, such as The Numbers Game also make the same error (I read that in print, and still had the same disruption). Bhagwati and Panagariya’s Why Growth Matters is similarly unreadable. Tables abruptly inserted into the middle of text, leading to the reader losing flow in the reading.

Telling a data story in book length is a completely different challenge to telling one in article length. And telling a story with data is a complete art form. When you’re putting a table there, you need to be able to explain why that table is important to the story – rather than putting it there just because it seems more rigorous.

Also the exact placement of the table (something that can’t be controlled well in Kindle, but is easy to fix in either HTML or print) matters – the table should be relevant to the piece of text immediately preceding and succeeding it, in a way that it doesn’t disrupt the reader’s flow. More importantly, the table should be able to add value at that particular point – perhaps building on something that has been described in the previous paragraph.

Book length makes it harder because people don’t normally expect tables and figures to disturb their reading flow when reading something of book length. Also, the book format means that it is not always possible to insert a table at a precise point (even in print, where pagination is an issue).

So how do you tell a book length story with data? Firstly, be very stingy about the data that you want to show – anything that doesn’t immediately add value should be banished to the appendix. Even the rigour, which academics might be particular about, can be pushed to the end notes (not footnotes, since those can be disruptive to flow as well, turning pages into half pages).

Then, once you know that showing a particular table or graph is inevitable to telling the story, put it either in the beginning or the end of a chapter. This way, it doesn’t break the reader’s flow. Then, refer to individual numbers in the middle of the text without having to put the entire table in there. Unless each and every data point in the table is important, banish it to the endnotes.

One other common mistake (I did it in my piece in Forbes published yesterday) is to put a big table and not talk about it. It only seeks to confuse the reader, who starts looking for explanations for everything in the table in later parts.

I guess authors and analysts tend to get possessive. If you have worked hard to produce insights from data, you seek to share as much of it as possible. And this can mean simply dumping data all the data in the piece without a regard for what the reader will do with it.

I’m making a note to myself to not repeat this mistake in future.

The problem with spider charts

On FiveThirtyEight, Nate Silver has a piece looking ahead to the Democratic primaries ahead of the presidential elections in the US next year. I don’t know enough about US politics to comment on the piece itself, but what caught my eye is the spider chart describing the various Democratic nominees.

This is a standard spider chart that people who read business news should recognise, so the appearance of such a chart isn’t big news. What bothers me, though, is that a respected data journalist like Nate Silver is publishing such charts, especially in an article under his own name. For spider charts do a lousy job of conveying information.

Implicitly, you might think that the area of the pentagon (in this case) thus formed conveys the strength of a particular candidate. Leaving aside the fact that the human eye can judge areas less well than lengths, the area of a spider chart accurately shows “strength” only in one corner case – where the values along all five axes are the same.

In all other cases, such as in the spider charts above, the area of the pentagon (or whatever-gon) thus formed depends on the order in which the factors are placed. For example, in this chart, why should black voters be placed between the asian/hispanic and millennials? Why should party loyalists lie between the asian/hispanics and the left?

I may not have that much insight into US politics, but it should be fairly clear that the ordering of the factors in this case has no particular sanctity. You should be able to jumble up the order of the axes and the information in the chart should remain the same.

The spider chart doesn’t work this way. If lengths of the “semidiagonals” (the five axes on which we are measuring) are $l_1, l_2, ... l_n$ , the area of the polygon thus formed equals $\frac{1}{2} sin \frac{360}{n} (l_1.l_2 + l_2.l_3 + ... + l_n.l_1)$ . It is not hard to see that for any value of $n \ge 4$ , the ordering of the “axes” makes a material difference in the area of the chart.

Moreover, in this particular case, with the legend being shown only with one politician, you need to keep looking back and forth to analyse where a particular candidate lies in terms of support among the five big democrat bases. Also, the representation suggests that these five bases have equal strength in the Democrat support base, while the reality may be far from it (again I don’t have domain knowledge).

Spider charts can look pretty, which might make them attractive for graphic designers. They are just not so good in conveying information.

PS: for this particular data set, I would just go with bars with small multiples (call me boring if you may). One set of bar graphs for each candidate, with consistent colour coding and ordering among the bars so that candidates can be compared easily.

Football Elo Application

This morning, I discovered the Club Elo Ratings, and promptly proceeded to analyse Liverpool FC’s performance over the years based on these ratings, and then correlated the performance by manager.

Then, playing around with the data of different clubs, I realised that there are plenty more stories to be told using this data, and they are best told by people who are passionate about their respective clubs. So the best thing I could do is to put the data out there (in a form similar to what I did for Liverpool), so that people can analyse how their clubs have performed over the years, and under different managers.

Sitting beside me as I was doing this analysis, my wife popped in with a pertinent observation. Now, she doesn’t watch football. She hates it that I watch so much football. Nevertheless, she has a strong eye for metrics. And watching me analyse club performance by manager, she asked me if I can analyse manager performance by club!

And so I’ve added that as well to the Shiny app that I’ve built. It might look a bit clunky, with two unrelate graphs, one on top of the other, but since the two are strongly related, it makes sense to have both in the same app. The managers listed in the bottom dropdown are those who have managed at least two clubs in the Premier League.

If you’re interested in Premier League football, you should definitely check out the app. I think there are some interesting insights to be gleaned (such as what I presented in this morning’s post).

Built by Shanks

This morning, I found this tweet by John Burn-Murdoch, a statistician at the Financial Times, about a graphic he had made for a Simon Kuper (of Soccernomics fame) piece on Jose Mourinho.

NEW: @KuperSimon digs into where it went wrong for Mourinho and Man United https://t.co/tpsHhY1PPK

Data from @clubelo show his infamous third season problem: José always reached his peak within his first two seasons at a club, and after that things went downhill rapidly pic.twitter.com/At1iswfX9j

— John Burn-Murdoch (@jburnmurdoch) December 18, 2018

Burn-Murdoch also helpfully shared the code he had written to produce this graphic, through which I discovered ClubElo, a website that produces chess-style Elo ratings for football clubs. They have a free and open API, through which Burn-Murdoch got the data for the above graphic, and which I used to download all-time Elo ratings for all clubs available (I can be greedy that way).

So the first order of business was to see how Liverpool’s rating has moved over time. The initial graph looked interesting, but not very interesting, so I decided to overlay it with periods of managerial regimes (the latter data I got through wikipedia). And this is what the all-time Elo rating of Liverpool looks like.

It is easy to see that the biggest improvement in the club’s performance came under the long reign of Bill Shankly (no surprises there), who took them from the Second Division to winning the old First Division. There was brief dip when Shankly retired and his assistant Bob Paisley took over (might this be the time when Paisley got intimidated by Shankly’s frequent visits to the club, and then asked him not to come any more?), but Paisley consolidated on Shankly’s improvement to lead the club to its first three European Cups.

Around 2010, when the club was owned by Americans Tom Hicks and George Gillett and on a decline in terms of performance, this banner became popular at Anfield.

The Yanks were subsequently yanked following a protracted court battle, to be replaced by another Yank (John W Henry), under whose ownership the club has done much better. What is also interesting from the above graph is the managerial change decisions.

At the time, Kenny Dalglish’s sacking at the end of the 2011-12 season (which ended with Liverpool losing the FA Cup final to Chelsea) seemed unfair, but the Elo rating shows that the club’s rating had fallen below the level when Dalglish took over (initially as caretaker). Then there was a steep ascent under Brendan Rodgers (leading to second in 2013-14), when Suarez bit and got sold and the team went into deep decline.

Again, we can see that Rodgers got sacked when the team had reverted to the rating that he had started off with. That’s when Jurgen Klopp came in, and thankfully so far there has been a much longer period of ascendance (which will hopefully continue). It is interesting to see, though, that the club’s current rating is still nowhere near the peak reached under Rafa Benitez (in the 2008-9 title challenge).

Impressed by the story that Elo Ratings could tell, I got data on all Premier League managers, and decided to repeat the analysis for all clubs. Here is what the analysis for the so-called “top 6” clubs returns:

We see, for example, that Chelsea’s ascendancy started not with Mourinho’s first term as manager, but towards the end of Ranieri’s term – when Roman Abramovich had made his investment. We find that Jose Mourinho actually made up for the decline under David Moyes and Louis van Gaal, and then started losing it. In that sense, Manchester United have got their sacking timing right (though they were already in decline by the time they finished last season in second place).

Manchester City also seem to have done pretty well in terms of the timing of managerial changes. And Spurs’s belief in Mauricio Pochettino, who started off badly, seems to have paid off.

I wonder why Elo Ratings haven’t made more impact in sports other than chess!

Just Plot It

One of my favourite work stories is from this job I did a long time ago. The task given to me was demand forecasting, and the variable I needed to forecast was so “micro” (this intersection that intersection the other) that forecasting was an absolute nightmare.

A side effect of this has been that I find it impossible to believe that it’s possible to forecast anything at all. Several (reasonably successful) forecasting assignments later, I still dread it when the client tells me that the project in question involves forecasting.

Another side effect is that the utter failure of standard textbook methods in that monster forecasting exercise all those years ago means that I find it impossible to believe that textbook methods work with “real life data”. Textbooks and college assignments are filled with problems that when “twisted” in a particular way easily unravel, like a well-tied tie knot. Industry data and problems are never as clean, and elegance doesn’t always work.

Anyway, coming back to the problem at hand, I had struggled for several months with this monster forecasting problem. Most of this time, I had been using one programming language that everyone else in the company used. The code was simultaneously being applied to lots of different sub-problems, so through the months of struggle I had never bothered to really “look at” the data.

I must have told this story before, when I spoke about why “data scientists” should learn MS Excel. For what I did next was to load the data onto a spreadsheet and start looking at it. And “looking at it” involved graphing it. And the solution, or the lack of it, lay right before my eyes. The data was so damn random that it was a wonder that anything had been forecast at all.

It was also a wonder that the people who had built the larger model (into which my forecasting piece was to plug in) had assumed that this data would be forecast-able at all (I mentioned this to the people who had built the model, and we’ll leave that story for another occasion).

In any case, looking at the data, by putting it in a visualisation, completely changed my perspective on how the problem needed to be tackled. And this has been a learning I haven’t let go of since – the first thing I do when presented with data is to graph it out, and visually inspect it. Any statistics (and any forecasting for sure) comes after that.

Yet, I find that a lot of people simply fail to appreciate the benefits of graphing. That it is not intuitive to do with most programming languages doesn’t help. Incredibly, even Python, a favoured tool of a lot of “data scientists”, doesn’t make graphing easy. Last year when I was forced to use it, I found that it was virtually impossible to create a PDF with lots of graphs – something that I do as a matter of routine when working on R (I subsequently figured out a (rather inelegant) hack the next time I was forced to use Python).

Maybe when you work on data that doesn’t have meaningful variables – such as images, for example – graphing doesn’t help (since a variable on its own has little information). But when the data remotely has some meaning – sales or production or clicks or words, graphing can be of immense help, and can give you massive insight on how to develop your model!

So go ahead, and plot it. And I won’t mind if you fail to thank me later!

Bangalore names are getting shorter

The Bangalore Names Dataset, derived from the Bangalore Voter Rolls (cleaned version here), validates a hypothesis that a lot of people had – that given names in Bangalore are becoming shorter. From an average of 9 letters in the name for a male aged around 80, the length of the name comes down to 6.5 letters for a 20 year old male.

What is interesting from the graph (click through for a larger version) is the difference in lengths of male and female names – notice the crossover around the age 25 or so. At some point in time, men’s names continue to become shorter while women’s names’ lengths stagnate.

So how are names becoming shorter? For one, honorific endings such as -appa, -amma, -anna, -aiah and -akka are becoming increasingly less common. Someone named “Krishnappa” (the most common name with the ‘appa’ suffix) in Bangalore is on average 56 years old, while someone named Krishna (the same name without the suffix) is on average only 44 years old. Similarly, the average age of people named Lakshmamma is 55, while that of everyone named Lakshmi is just 40. while the average Lakshmi (same name no suffix) is just 40.

In fact, if we look at the top 12 male and female names with a honorific ending, the average age of the version without the ending is lower than that of the version with the ending. I’ve even graphed some of the distributions to illustrate this.

In each case, the red line shows the distribution of the longer version of the name, and the blue line the distribution of the shorter version

In one of the posts yesterday, we looked at the most typical names by age in Bangalore. What happens when we flip the question? Can we define what are the “oldest” and “youngest” names? Can we define these based on the average age of people who hold that name? In order to rule out fads, let’s stick to names that are held by at least 10000 people each.

These graphs are candidates for my own Bad Visualisations Tumblr, but I couldn’t think of a better way to represent the data. These graphs show the most popular male and female names, with the average age of a voter with that given name on the X axis, and the number of voters with that name on the Y axis. The information is all in the X axis – the Y axis is there just so that names don’t overlap.

So Karthik is among the youngest names among men, with an average age among voters being about 28 (remember this is not the average age of all Karthiks in Bangalore – those aged below 18 or otherwise not eligible to vote have been excluded). On the women’s side, Divya, Pavithra and Ramya are among the “youngest names”.

At the other end, you can see all the -appas and -ammas. The “oldest male name” is Krishnappa, with an average age 56. And then you have Krishnamurthy and Narayana, which don’t have the -appa suffix but represent an old population anyway (the other -appa names just don’t clear the 10000 people cutoff).

More women’s names with the -amma suffix clear the 10000 names cutoff, and we can see that pretty much all women’s names with an average age of 50 and above have that suffix. And the “oldest female name”, subject to 10000 people having that name, is Muniyamma. And then you have Sarojamma and Jayamma and Lakshmamma. And a lot of other ammas.

What will be the oldest and youngest names we relax the popularity cutoff, and instead look at names with at least 1000 people? The five youngest names are Dhanush, Prajwal, Harshitha, Tejas and Rakshitha, all with an average age (among voters) less than 24. The five oldest names are Papamma, Kannamma, Munivenkatappa, Seethamma and Ramaiah.

This should give another indication of where names are headed in Bangalore!