Telling stories with data

I’m about 20% through with The Verdict by Prannoy Roy and Dorab Sopariwala. It’s a fascinating book, except for one annoyance – it is full of tables that serve no purpose but to break the flow of text.

I must mention that I’m reading the book on the Kindle, which means that the tables can pose a major annoyance. Text breaks off midway through one page, and the next couple of pages involve a table or two, with several lines of text explaining what’s in the table. And then the text continues. It makes for a rather disruptive reading experience. And some of the tables have just one data point – making one wonder why it has been inserted there at all.

This is not the first book that I’ve noticed that makes this mistake. Some of the sports analytics books I’ve read in recent times, such as The Numbers Game also make the same error (I read that in print, and still had the same disruption). Bhagwati and Panagariya’s Why Growth Matters is similarly unreadable. Tables abruptly inserted into the middle of text, leading to the reader losing flow in the reading.

Telling a data story in book length is a completely different challenge to telling one in article length. And telling a story with data is a complete art form. When you’re putting a table there, you need to be able to explain why that table is important to the story – rather than putting it there just because it seems more rigorous.

Also the exact placement of the table (something that can’t be controlled well in Kindle, but is easy to fix in either HTML or print) matters –  the table should be relevant to the piece of text immediately preceding and succeeding it, in a way that it doesn’t disrupt the reader’s flow. More importantly, the table should be able to add value at that particular point – perhaps building on something that has been described in the previous paragraph.

Book length makes it harder because people don’t normally expect tables and figures to disturb their reading flow when reading something of book length. Also, the book format means that it is not always possible to insert a table at a precise point (even in print, where pagination is an issue).

So how do you tell a book length story with data? Firstly, be very stingy about the data that you want to show – anything that doesn’t immediately add value should be banished to the appendix. Even the rigour, which academics might be particular about, can be pushed to the end notes (not footnotes, since those can be disruptive to flow as well, turning pages into half pages).

Then, once you know that showing a particular table or graph is inevitable to telling the story, put it either in the beginning or the end of a chapter. This way, it doesn’t break the reader’s flow. Then, refer to individual numbers in the middle of the text without having to put the entire table in there. Unless each and every data point in the table is important, banish it to the endnotes.

One other common mistake (I did it in my piece in Forbes published yesterday) is to put a big table and not talk about it. It only seeks to confuse the reader, who starts looking for explanations for everything in the table in later parts.

I guess authors and analysts tend to get possessive. If you have worked hard to produce insights from data, you seek to share as much of it as possible. And this can mean simply dumping data all the data in the piece without a regard for what the reader will do with it.

I’m making a note to myself to not repeat this mistake in future.

Vlogging!

The first seed was sown in my head by Harish “the Psycho” J, who told me a few months back that nobody reads blogs any more, and I should start making “analytics videos” to increase my reach and hopefully hit a new kind of audience with my work.

While the idea was great, I wasn’t sure for a long time what videos I could make. After all, I’m not the most technical guy around, and I had no patience for making videos on “how to use regression” and stuff like that. I needed a topic that would be both potentially catchy and something where I could add value. So the idea remained an idea.

For the last four or five years, my most common lunchtime activity has been to watch chess videos. I subscribe to the Youtube channels of Daniel King and Agadmator, and most days when I eat lunch alone at home are spent watching their analyses of games. Usually this routine gets disrupted on Fridays when the wife works from home (she positively hates these videos), but one Friday a couple of months back I decided to ignore her anyway and watch the videos (she was in her room working).

She had come out to serve herself to another serving of whatever she had made that day and saw me watching the videos. And suddenly asked me why I couldn’t make such videos as well. She has seen me work over the last seven years to build what I think is a fairly cool cricket visualisation, and said that I should use it to make little videos analysing cricket matches.

And since then my constant “background process” has been to prepare for these videos. Earlier, Stephen Rushe of Cricsheet used to unfailingly upload ball by ball data of all cricket matches as soon as they were done. However, two years back he went into “maintenance mode” and has stopped updating the data. And so I needed a method to get data as well.

Here, I must acknowledge the contributions of Joe Harris of White Ball Analytics, who not only showed me the APIs to get ball by ball data of cricket matches, but also gave very helpful inputs on how to make the visualisation more intuitive, and palatable to the normal cricket fan who hasn’t seen such a thing before. Joe has his own win probability model based on ball by ball data, which I think is possibly superior to mine in a lot of scenarios (my model does badly in high-scoring run chases), though I’ve continued to use my own model.

So finally the data is ready, and I have a much improved visualisation to what I had during the IPL last year, and I’ve created what I think is a nice app using the Shiny package that you can check out for yourself here. This covers all T20 international games, and you can use the app to see the “story of each game”.

And this is where the vlogging comes in – in order to explain how the model works and how to use it, I’ve created a short video. You can watch it here:

While I still have a long way to go in terms of my delivery, you can see that the video has come out rather well. There are no sync issues, and you see my face also in one corner. This was possible due to my school friend Sunil Kowlgi‘s Outklip app. It’s a pretty easy to use Chrome app, and the videos are immediately available on the platform. There is quick YouTube integration as well, for you to upload them.

And this is not a one time effort – going forward I’ll be making videos of limited overs games analysing them using my app, and posting them on my Youtube channel (or maybe I’ll make a new channel for these videos. I’ll keep you updated). I hope to become a regular Vlogger!

So in the meantime, watch the above video. And give my app a spin. Soon I’ll be releasing versions covering One Day Internationals and franchise T20s as well.

 

Pertinent Observations Grows Up

Over the weekend, I read Ben Blatt’s Nabokov’s Favourite Word Is Mauve, a simple natural language processing based analysis of hundreds of popular authors and their books. In this, Blatt uses several measures of goodness or badness of writing, and then measures different authors by it.

So he finds, for example, that Danielle Steel opens a lot of her books by talking about the weather, or that Charles Dickens uses a lot of “anaphora” (anyone who remembers the opening of A Tale of Two Cities shouldn’t be surprised by that). He also talks about the use of simple word counts to detect authorship of unknown documents (a separate post will come on that soon).

As someone who has already written a book (albeit nonfiction), I found a lot of this rather interesting, and constantly found myself trying to evaluate myself on the metrics with which Blatt subjected the famous authors to. And one metric that I found especially interesting was the “Flesch-Kincaid grade level“, which is a measure of complexity of language in a work.

It is a fairly simple formula, based on a linear combination of the average number of words per sentence and the average number of syllables per word. The formula goes like this:

Flesch-Kincaid Grade Score

And the result of the formula tells the approximate school grade of a reader who will be able to understand your writing. As you see, it is not a complex formula, and the shorter your sentences and shorter your words (measured in syllables), the simpler your prose is supposed to be.

The simplest works by this metric as mentioned in Blatt’s book are the works of Dr. Seuss, such as The Cat in the Hat or Green Eggs and Spam, on account of the exclusive usage of a small set of words in both books (Dr Seuss wrote the latter as a challenge, not unlike the challenges we would pose each other during “class participation” in business school). These books have a negative grade score, technically indicating that even a nursery kid should be able to read them, but actually meaning they’re simply easy to read.

Since the Flesch Kincaid Grade Score is based on a simple set of parameters (word count, sentence count and syllable count), it was rather simple for me to implement that on the posts from this blog.

I downloaded an XML export of all posts (I took this dump some two or three weeks ago), and then used R, with the Tidytext package to analyse the posts. Word count was most straightforward, using the str_count function in the stringr package (part of the tidyverse). Sentence count was a bit more complicated – there were no ready algorithms. I instead just searched for “sentence enders” (., ?, !, etc. I know the use of . in abbreviations creates problems but I can live with that).

Syllable count was the hardest. Again, there are some packages but it’s incredibly hard to use. Finally after much searching, I came across some code that again approximates this and used it.

Now that the technical stuff is done with, let’s get to the content. This word count, sentence count and syllable count all flow into calculating the Flesch-Kincaid (FK) score, which is the approximate class that one needs to be in to understand the text. Let’s just plot the FK score for all my blog posts (a total of 2341 of them) against time. I’ve added a regression line for good effect.

The trend is pretty clear. Over time, this blog has become more complicated and harder to read. In fact, drawing this graph slightly differently gives another message. This time, instead of a regression line, I’ve drawn a curve showing the trend.

When I started writing in 2004, I was at a 5th standard level. This increased steadily for the first two years (I gained a lot of my steady readership in this time) to get to about 8th standard, and plateaued there for a bit. And then again around 2009-10 there was n increase, as my blog got up to the 10th standard level. It’s pretty much stayed there ever since, apart from a tiny bump up in the end of 2014.

I don’t know if this increase in “complexity” of my blog is a good or a bad thing. On the one hand, it shows growing up. On the other, it’s becoming tougher to read, which has probably coincided with a plateauing (or even a drop) in the readership as well.

Let me know what you think – if you prefer this “grown up style”, or if you want to go back to the more simple writing I started off with.

Newsletter!

So after much deliberation and procrastination, I’ve finally started a newsletter. I call it “the art of data science” and the title should be self-explanatory. It’s pure unbridled opinion (the kind of which usually goes on this blog), except that I only write about one topic there.

I intend to have three sections and then a “chart of the edition” (note how cleverly I’ve named this section to avoid giving much away on the frequency of the newsletter!). This edition, though, I ended up putting too much harikathe, so I restricted to two sections before the chart.

I intend to talk a bit each edition about some philosophical part of dealing with data (this section got a miss this time), a bit on data analysis methods (I went a bit meta on this this time) and a little bit on programming languages (which I used for bitching a bit).

And that I plan to put a “chart of the edition” means I need to read newspapers a lot more, since you are much more likely to find gems (in either direction) there than elsewhere. For the first edition, I picked off a good graph I’d seen on Twitter, and it’s about Hull City!

Anyway, enough of this meta-harikathe. You can read the first edition of the newsletter here. In case you want to get it in your inbox each week/fortnight/whenever I decide to write it, then subscribe here!

And put feedback (by email, not comments here) on what you think of the newsletter!

Medium stats

So Medium sends me this email:

Congratulations! You are among the top 10% of readers and writers on Medium this year. As a small thank you, we’ve put together some highlights from your 2016.

Now, I hardly use Medium. I’ve maybe written one post there (a long time ago) and read only a little bit (blogs I really like I’ve put on RSS and read on Feedly). So when Medium tells me that I, who considers myself a light user, is “in the top 10%”, they’re really giving away the fact that the quality of usage on their site is pretty bad.

Sometimes it’s bloody easy to see through flattery! People need to be more careful on what the stats they’re putting out really convey!

 

Bloggers and anti-bloggers

I know this post “dates” me as someone who started blogging back in the peak era of blogging in the mid 2000s. But what the hell! 

I think you can consider yourself to have “made it” as a blogger when a post that you write attracts abuse. Sometimes this abuse could be in public, in the comments section of the blog. At other times, the abuse is in private, when someone meets you or calls you, and abuses you for writing what you wrote.

As long as you’ve been reasonable in your blogging (which the early years of this blog’s predecessor cannot exactly claim), abuse on your comments section is more of an indicator of the thin-skinnedness of the abuser, rather than you crossing lines on what you should write about.

At this point in time, it is pertinent to introduce the class of people who I call as “anti-bloggers”. Sometimes they might themselves have a blog, but that is not necessary, what is necessary is that they have a “holier than thou” attitude.

Anti-bloggers are people with especially thin skins who are always on the lookout for something to outrage about, and blogs, which allow people to express themselves freely on a public forum without editorial oversight, are a common whipping boy.

This outrage could come in several forms. The thicker-skinned version of this outrage happens from people who abuse you only if they think you’ve abused them on the blog (good bloggers take care to never mention names in a negative manner, so this is usually a case of “kumbLkai kaLLa heglmuTT nODkonDa” (the pumpkin thief looked at his shoulder; it’s a Kannada proverb meaning something like “every thief has a straw in his beard) ).

The thinner skinned version of anti-bloggers find it even easier to find things to outrage about. Look at the Bangalore post I’d written ten years back. There was no hint that I’d written about anyone at all, but the post received heaps of abuse, from people who manufactured some kind of entity that the post purportedly offended!

The most annoying anti-bloggers are those that abuse you when you simply pen down an observation that is there for all to see. I won’t take specific examples now, but sometimes the simple act of reporting a fact that is evident to everyone can offend people, for its existence on paper (a website, rather) gives it new-found legitimacy!

This last bit can also help explain the annoyance of some sections of the “mainstream media” with “social media” such as blogs/twitter. The worthies in the mainstream media had established certain unwritten rules by which certain facts/events wouldn’t be put down on paper.

The mention of these events in social media (which is unedited) suddenly gave these events/happenings sudden legitimacy, which steered the overall narrative away from where it existed during the mainstream media monopoly, annoying the mainstream media!

One penultimate point – anti-bloggers are the same people who talk about the glories of the days prior to social media (this piece in The Guardian is an especially strong specimen), when people could only read news that was filtered and possibly censored by newspaper editors.

And finally, ever since my credentials as a blogger were established about a decade back, some people have started explicitly mentioning to me when they are saying something “off the record”. And I’ve always respected these conditions!

Commenting on social media

While I’m more off than on in terms of my consumption of social media nowadays, I find myself commenting less and less nowadays.

I’ve stopped commenting on blogs because I primarily consume them using an RSS reader (Feedly) on my iPad, and need to click through and use my iPad keyboard to leave comments, a hard exercise. And comments on this blog make me believe that it’s okay to not comment on blogs any more.

On Facebook, I leave the odd comment but find that most comments add zero value. “Oh, looking so nice” and “nice couple” and things like that which might flatter some people, but which make absolutely no sense once you start seeing through the flattery.

So the problem on Facebook is “congestion”, where a large number of non-value-adding comments may crowd out the odd comment that actually adds value, so you as a value-adding-commentor decide to not comment at all.

The problem on LinkedIn is that people use it mostly as a medium to show off (that might be true of all social media, but LinkedIn is even more so), and when you leave a comment there, you’re likely to attract a large number of show-offers who you are least interested in talking to. Again, there’s the Facebook problem here in terms of congestion. There is also the problem that if you leave a comment on LinkedIn, people might think you’re showing off.

Twitter, in that sense, is good in that you can comment and selectively engage with people who reply to your comment (on Facebook, when all replies are in one place, such selective engagement is hard, and you can offend people by ignoring them). You can occasionally attract trolls, but with a judicious combination of ignoring, muting and blocking, those can be handled.

However, in my effort to avoid outrage (I like to consume news but don’t care about random people’s comments on it), I’ve significantly pruned my following list. Very few “friends”. A few “twitter celebrities”. Topic-specific studs. The problem there is that you can leave comments, but when you see that nobody is replying to them, you lose interest!

So it’s Jai all over the place.

No comments.