More on interactive graphics

So for a while now I’ve been building this cricket visualisation thingy. Basically it’s what I think is a pseudo-innovative way of describing a cricket match, by showing how the game ebbs and flows, and marking off the key events.

Here’s a sample, from the ongoing game between Chennai Super Kings and Kolkata Knight Riders.

As you might appreciate, this is a bit cluttered. One “brilliant” idea I had to declutter this was to create an interactive version, using Plotly and D3.js. It’s the same graphic, but instead of all those annotations appearing, they’ll appear when you hover on those boxes (the boxes are still there). Also, when you hover over the line you can see the score and what happened on that ball.

When I came up with this version two weeks back, I sent it to a few friends. Nobody responded. I checked back with them a few days later. Nobody had seen it. They’d all opened it on their mobile devices, and interactive graphics are ill-defined for mobile!

Because on mobile there’s no concept of “hover”. Even “click” is badly defined because fingers are much fatter than mouse pointers.

And nowadays everyone uses mobile – even in corporate settings. People who spend most time in meetings only have access to their phones while in there, and consume all their information through that.

Yet, you have visualisation “experts” who insist on the joys of tools such as Tableau, or other things that produce nice-looking interactive graphics. People go ga-ga over motion charts (they’re slightly better in that they can communicate more without input from the user).

In my opinion, the lack of use on mobile is the last nail in the coffin of interactive graphics. It is not like they didn’t have their problems already – the biggest problem for me is that it takes too much effort on the part of the user to understand the message that is being sent out. Interactive graphics are also harder to do well, since the users might use them in ways not intended – hovering and clicking on the “wrong” places, making it harder to communicate the message you want to communicate.

As a visualiser, one thing I’m particular about is being in control of the message. As a rule, a good visualisation contains one overarching message, and a good visualisation is one in which the user gets the message as soon as she sees the chart. And in an interactive chart which the user has to control, there is no way for the designer to control the message!

Hopefully this difficulty with seeing interactive charts on mobile will mean that my clients will start demanding them less (at least that’s the direction in which I’ve been educating them all along!). “Controlling the narrative” and “too much work for consumer” might seem like esoteric problems with something, but “can’t be consumed on mobile” is surely a winning argument!

A banker’s apology

Whenever there is a massive stock market crash, like the one in 1987, or the crisis in 2008, it is common for investment banking quants to talk about how it was a “1 in zillion years” event. This is on account of their models that typically assume that stock prices are lognormal, and that stock price movement is Markovian (today’s movement is uncorrelated with tomorrow’s).

In fact, a cursory look at recent data shows that what models show to be a one in zillion years event actually happens every few years, or decades. In other words, while quant models do pretty well in the average case, they have thin “tails” – they underestimate the likelihood of extreme events, leading to building up risk in the situation.

When I decided to end my (brief) career as an investment banking quant in 2011, I wanted to take the methods that I’d learnt into other industries. While “data science” might have become a thing in the intervening years, there is still a lot for conventional industry to learn from banking in terms of using maths for management decision-making. And this makes me believe I’m still in business.

And like my former colleagues in investment banking quant, I’m not immune to the fat tail problem as well – replicating solutions from one domain into another can replicate the problems as well.

For a while now I’ve been building what I think is a fairly innovative way to represent a cricket match. Basically you look at how the balance of play shifts as the game goes along. So the representation is a line graph that shows where the balance of play was at different points of time in the game.

This way, you have a visualisation that at one shot tells you how the game “flowed”. Consider, for example, last night’s game between Mumbai Indians and Chennai Super Kings. This is what the game looks like in my representation.

What this shows is that Mumbai Indians got a small advantage midway through the innings (after a short blast by Ishan Kishan), which they held through their innings. The game was steady for about 5 overs of the CSK chase, when some tight overs created pressure that resulted in Suresh Raina getting out.

Soon, Ambati Rayudu and MS Dhoni followed him to the pavilion, and MI were in control, with CSK losing 6 wickets in the course of 10 overs. When they lost Mark Wood in the 17th Over, Mumbai Indians were almost surely winners – my system reckoning that 48 to win in 21 balls was near-impossible.

And then Bravo got into the act, putting on 39 in 10 balls with Imran Tahir watching at the other end (including taking 20 off a Mitchell McClenaghan over, and 20 again off a Jasprit Bumrah over at the end of which Bravo got out). And then a one-legged Jadhav came, hobbled for 3 balls and then finished off the game.

Now, while the shape of the curve in the above curve is representative of what happened in the game, I think it went too close to the axes. 48 off 21 with 2 wickets in hand is not easy, but it’s not a 1% probability event (as my graph depicts).

And looking into my model, I realise I’ve made the familiar banker’s mistake – of assuming independence and Markovian property. I calculate the probability of a team winning using a method called “backward induction” (that I’d learnt during my time as an investment banking quant). It’s the same system that the WASP system to evaluate odds (invented by a few Kiwi scientists) uses, and as I’d pointed out in the past, WASP has the thin tails problem as well.

As Seamus Hogan, one of the inventors of WASP, had pointed out in a comment on that post, one way of solving this thin tails issue is to control for the pitch or  regime, and I’ve incorporated that as well (using a Bayesian system to “learn” the nature of the pitch as the game goes on). Yet, I see I struggle with fat tails.

I seriously need to find a way to take into account serial correlation into my models!

That said, I must say I’m fairly kicked about the system I’ve built. Do let me know what you think of this!

When a two-by-two ruins a scatterplot

The BBC has some very good analysis of the Brexit vote (how long back was that?), using voting data at the local authority level, and correlating it with factors such as ethnicity and educational attainment.

In terms of educational attainment, there is a really nice chart, that shows the proportion of voters who voted to leave against the proportion of population in the ward with at least a bachelor’s degree. One look at the graph tells you that the correlation is rather strong:

‘Source: http://www.bbc.com/news/uk-politics-38762034And then there is the two-by-two that is superimposed on this – with regions being marked off in pink and grey. The idea of the two-by-two must have been to illustrate the correlation – to show that education is negatively correlated with the “leave” vote.

But what do we see here? A majority of the points lie in the bottom left pink region, suggesting that wards with lower proportion of graduates were less likely to leave. And this is entirely the wrong message for the graph to send.

The two-by-two would have been useful had the points in the graph been neatly divided into clusters that could be arranged in a grid. Here, though, what the scatter plot shows is a nice negatively correlated linear relationship. And by putting those pink and grey boxes, the illustration is taking attention away from that relationship.

Instead, I’d simply put the scatter plot as it is, and maybe add the line of best fit, to emphasise the negative correlation. If I want to be extra geeky, I might also write down the $R^2$ next to the line, to show the extent of correlation!