Open and closed platforms

This is a blogpost that I had planned a very long time (4-5 weeks) ago, and I’m only getting down to write it now. So my apologies if the quality is not as good as my blogposts usually are. 

Many of you would have looked at the title of this blogpost and assumed that the trigger for this was the “acquisition” of Joe Rogan’s podcast by Spotify. For a large sum of money, Spotify is “taking his podcast private”, making it exclusive to Spotify subscribers.

However, this is only an “immediate trigger” for writing this post. I’d planned this post way back in April when I’d written one of my Covid-19 related blogposts – maybe it was this one.

I had joked the post needed to be on Medium for it to be taken seriously (a lot of covid related analysis was appearing on Medium around that time). Someone suggested I actually put it on Medium. I copied and pasted it there. Medium promptly took down my post.

I got pissed off and swore to never post on Medium again. I got reminded of the time last year when Youtube randomly pulled down one of my cricket videos when someone (an IP troll, I later learnt) wrongly claimed that I’d used copyrighted sounds in my video (the only sound in that video was my own voice).  I had lodged a complaint with Youtube, and my video was resurrected, but it was off air for a month (I think).

Medium and Youtube are both examples of closed platforms. All content posted on these platforms are “native to the platform”. These platforms provide a means of distributing (and sometimes even marketing) the content, and all content posted there essentially belongs to the platform. Yes, you get paid a cut of the ad fee (in case your Youtube channel becomes super powerful, for example), but Youtube decides whether your video deserves to be there at all, and whose homepages to put it on.

The main feature of a closed platform is that any content created on the platform needs to be consumed on the same platform. A video I’ve uploaded on Youtube is only accessible on Youtube. A medium post can only be read on medium. A tweet can only be read on twitter. A Facebook post only on Facebook.

The advantage with closed platforms is that by submitting your content to the platform, you are hoping to leverage some benefits the platform might offer, like additional marketing and distribution, and discovery.

This blog doesn’t work that way. Blogposts work through this technology called “RSS”, and to read what I’m writing here you don’t need to necessarily visit noenthuda.com. You can read it on the feed reader of your choice (Feedly is what I use). Of course there is the danger that one feed reader can have overwhelming marketshare, and the destruction of that feed reader can kill the ecosystem itself (like it happened with Google Reader in 2013). Yet, RSS being an open platform means that this blog still exists, and you can continue to receive it on the RSS reader of your choice. If Medium were to shut down tomorrow, all Medium posts might be lost.

Another example of an open platform is email – it doesn’t matter what email service or app you use, my email and yours is interoperable. India’s Universal Payment Interface (UPI) is another open platform – the sender and receiver can use apps of their choice and still transact.

And yet another open platform (which a lot of people didn’t really realise is an open platform) is podcasting. Podcasts run on the RSS protocol. So when you subscribe to a podcast using Apple Podcasts, it is similar to adding a blog to your Feedly. This thread by Ben Thompson of Stratechery (that I just stumbled upon when I started writing this post) sums it up well:

What Spotify is trying to do (with the Joe Rogan and Ringer deals) is to take these contents off open platforms and put it on its own closed platform. Some people (like Rogan) will take the bait since they’re getting paid for it. However, this comes at the cost of control – like I’m not sure if we’ll have another episode of Rogan’s podcast where host and guest light up a joint.

Following my experiences with Medium and Youtube, when my content was yanked off for no reason (or for flimsy reasons), I’m not sure I like closed platforms any more. Rather, someone needs to pay me a lot of money to take my content to a closed platform (speaking of which, do you know that all my writing for Mint (written in 2013-18) is behind their newly erected paywall now?).

In closing I must mention that platforms being “open” and platforms being “free” are orthogonal. A paid podcast or newsletter is still on an open platform (see Ben Thompson tweetstorm above), since it can be consumed on a medium independent of the one where it was produced – essentially a different feed is generated depending on what the customer has paid for.

Now that I’ve written this post, I don’t know what the point of this is. Maybe it’s just for collecting and crystallising my own thoughts, which is the point behind most of my blogposts anyway.

PS: We have RSS feeds for text and podcasts for audio. I wonder why we don’t have a popular and open protocol for video.

Blogs and tweetstorms

The “tweetstorm” is a relatively new art form. It basically consists of a “thread” of tweets that serially connect to one another, which all put together are supposed to communicate one grand idea.

It is an art form that grew organically on twitter, almost as a protest against the medium’s 140 (now raised to 280) character limit. Nobody really knows who “invented” it. It had emerged by 2014, at least, as this Buzzfeed article cautions.

In the early days, you would tweetstorm by continuously replying to your own tweet, so the entire set of tweets could be seen by readers as a “thread”. Then in 2017, Twitter itself recognised that it was being taken over by tweetstorms, and added “native functionality” to create them.

In any case, as with someone from “an older generation” (I’m from the blogging generation, if I can describe myself so), I was always fascinated by this new art form that I’d never really managed to master. Once in a while, rather than writing here (which is my natural thing to do), I would try and write a tweet storm. Most times I didn’t succeed. Clearly, someone who is good at an older art form struggles to adapt to newer ones.

And then something clicked on Wednesday when I wrote my now famous tweetstorm on Bayes Theorem and covid-19 testing. I got nearly two thousand new followers, I got invited to a “debate” on The Republic news channel and my tweetstorm is circulated in apartment Telegram groups (though so far nobody has yet sent my my own tweetstorm).

In any case, I don’t like platforms where I’m not in charge of content (that’s a story for another day), and so thought I should document my thoughts here on my blog. And I did so last night. At over 1200 words, it’s twice as long as my average blogpost (it tired me so much that the initial version, which went on my RSS feed, had a massive typo in the last line!).

And while I was writing that, I realised that the tone in the blog post was very different from what I sounded like in my famous tweetstorm. In my post (at least by my own admission, though a couple of friends have agreed with me), I sound reasonable and measured. I pleasantly build up the argument and explain what I wanted to explain with a few links and some data. I’m careful about not taking political sides, and everything. It’s how good writing should be like.

Now go read my tweetstorm:

Notice that right from the beginning I’m snide. I’m bossy. I come across as combative. And I inadvertently take sides here and there. Overall, it’s bad writing. Writing that I’m not particularly proud of, though it gave me some “rewards”.

I think that’s inherent to the art form. While you can use as many tweets as you like, you have a 280 character limit in each. Which means that each time you’re trying to build up an argument, you find yourself running out of characters, and you attempt to “finish your argument quickly”. That means that each individual tweet can come across as too curt or “to the point”. And  when you take a whole collection of curt statements, it’s easy to come across as rude.

That is possibly true of most tweetstorms. However good your intention is when you sit down to write them, the form means that you will end up coming across as rude and highly opinionated. Nowadays, people seem to love that (maybe they’ve loved it all the time, and now there is an art form that provides this in plenty), and so tweetstorms can get “picked up” and amplified and you become popular. However, try reading it when you’re yourself in a pleasant and measured state, and you find that most tweetstorms are unreadable, and constitute bad writing.

Maybe I’m writing this blogpost because I’m loyal to my “native art form”. Maybe my experience with this artform means that I write better blogs than tweetstorms. Or maybe it’s simply all in my head. Or that blogs are “safe spaces” nowadays – it takes effort for people to leave comments on blogs (compared to replying to a tweet with abuse).

I’ll leave you with this superb old article from The Verge on “how to tweetstorm“.

Yet another social media sabbatical

Those of you who know me well know that I keep taking these social media sabbaticals. Once in a while I decide that I’m spending too much time on these platforms, wasting both time and mental energy, and log off. Time has come for yet another such break.

I had a bumper day on twitter yesterday. I wrote this one tweet storm that went viral. Some 2000 plus retweets and all that. Basically I used some 15 tweets to explain Bayes’s Theorem, a concept that most people find really hard to understand.

For the last 24 hours, my twitter mentions have been a mess. I’ve tried various things – applying filters, switching from the native app to tweetdeck, etc. but I find that I keep checking my mentions for that dopamine rush that comes out of new followers (I have some 1500 new followers after the tweetstorm, including Chris Arnade of Dignity fame), new retweets and new likes.

And the dopamine rush is frequently killed by hate, as a tweetstorm like this will inevitably generate. I did another tweetstorm this morning detailing this hate – it has to do with the “two Overton Windows” post I’d written a couple of weeks ago.

People are so deranged that even a maths tweetstorm (like the one at the beginning of this post) can be made political, and you see people go on and on.

In fact, there is this other piece I had written (for Mint, back in 2015) that again uses Bayes’s Theorem to explain online flamewars. Five years down, everything I wrote is true.

It is futile to engage with most people on Twitter, especially when they take their political selves too seriously. It can be exhausting, and 27 hours after I wrote that tweetstorm I’m completely exhausted.

So yeah this is not a social media sabbatical like my previous ones where I logged off all media. As things stand I’m only off Twitter (I’ve taken mitigating steps on other platforms to protect my blood pressure and serotonin).

Then again, those of you who know me well know that when I’m off twitter I’ll be writing more here. You can continue to expect that. I hope to be more productive here, and in my work (I’m swamped with work this lockdown) as well.

I continue to be available on WhatsApp, and Telegram, and email. Those of you who have my email or number can reach me in one of those places. For everything else, there’s the “contact” tab on this blog.

See you more regularly here in the coming days!

Should I tweet at all?

This is not a rhetorical question.

I was doing some random data analysis today. I downloaded an archive of all my tweets, and of all my blog posts, and was looking at some simple statistics. I won’t bore you with a lot of the mundane details.

One thing that I must mention is that the hypothesis that twitter activity has an adverse impact on my blogging is disproved. I was looking at the number of words I’ve put on twitter each week and the number of words I’ve blogged in the same week. The two are uncorrelated.

 

In any case, so far I’ve tweeted 60,716 tweets over the course of eleven and a half years. My tweets include at total of 992453 words. Ignoring other handles, links and punctuation, maybe we can round this down to 950000. In other words, in nine and a half years I’ve tweeted nine and a half hundred thousand words.

Or that on twitter alone I write a hundred thousand words a year. 

The content of my book was about 52,000 words (IIRC). In other words, I write enough content for two books EACH YEAR on twitter. In 2013, I tweeted nearly four books worth of content.

That, however, is not the only reason why I wonder if I should tweet at all. While I’ve discovered a lot of interesting people, and made interesting connections, and can “semi-keep-in-touch” with people through twitter, I’m not really sure about the “impact” of my tweets.

I thought I’ll look at the tweets that have been most retweeted.

full_text Date retweet_count favorite_count Link
Why does the government / ruling party put out tweets with basic arithmetic errors? ?14.98+?9.02 is ?24 not ?27.44 https://t.co/oFaBNDYgpc 2017-09-19 350 416 https://twitter.com/karthiks/status/910122027306164229
Remember that Richter scale is logarithmic. Base 10 if I’m not wrong. So 7.7 is 10 times as bad as 6.7 2015-04-25 171 40 https://twitter.com/karthiks/status/591858524147175425
Our @uber driver tonight was one Mr Akmal. He dropped us successfully. 2017-12-24 148 312 https://twitter.com/karthiks/status/944964502071623681
The greatest Hindi movie about Rajputs is Jaane Tu Ya Jaane Na 2017-11-24 142 299 https://twitter.com/karthiks/status/934173502281801730
Based on interim data, in 17 states NOTA has got more votes than AAP. #MintElections #MeaninglessComparisons http://t.co/LxZvtNme1P 2014-05-16 134 19 https://twitter.com/karthiks/status/467242247965007872
A whopping 332 out of 542 constituencies in the just-concluded General Elections saw a two-way contest. Another 184 saw three-way contests.

In contrast, in the 2014 elections only 169 two-way contests, 278 three-cornered contests and 90 four-cornered contests

2019-05-24 95 174 https://twitter.com/karthiks/status/1131912762357981184
“these dark days” is a euphemism for “people I didn’t vote for have formed the government” 2019-12-19 69 242 https://twitter.com/karthiks/status/1207659817579335681
I have built this app that recommends single malt whiskies based on what you already like.

https://t.co/B4PqxjUQI2

Details here: https://t.co/kc3yG1mx2o

2018-11-02 56 202 https://twitter.com/karthiks/status/1058347106438705153
Amazing number of commies on this list RT @suar4sure: “@BookLuster: Which dictator killed the most People? http://t.co/WlJDLAiMAn” 2014-07-23 44 13 https://twitter.com/karthiks/status/491876757176188928
If BJP hadn’t split, numbers would have been: Cong: 91, BJP: 86, JDS: 35 @gkjohn 2013-05-08 39 3 https://twitter.com/karthiks/status/332065593660407810
Today @moneycontrolcom / @CNNnews18 have unleashed a monstrosity of a map. The map explains nothing, and nothing can explain the map!

https://t.co/VOooy26Ra2

2019-02-21 36 80 https://twitter.com/karthiks/status/1098577156198805504
Stud thread https://t.co/gvuZIjV71I 2018-07-22 34 105 https://twitter.com/karthiks/status/1021130139290226688
there’s one piece of @ShashiTharoor ‘s writing that I’ve read multiple times – his blurb for my book. When I first read it, I was amazed at how precisely it communicated the idea of my book – much better than I’d ever managed to do. https://t.co/Lz2I9ZwW0L 2017-12-14 33 138 https://twitter.com/karthiks/status/941357298840059904
Did you know the cube root law of assembly size?

It’s a heuristic that states that the optimal size of a national assembly is the cube root of the population

2019-12-13 27 80 https://twitter.com/karthiks/status/1205408450810798080
Great piece by Dheeraj Sanghi on the expulsion of students from IIT Roorkee: http://t.co/uxPduX680z 2015-07-12 27 11 https://twitter.com/karthiks/status/620138692762415104
did the Chinese workers use One Belt to beat up the police, and then escape on One Road? https://t.co/7lldCWrpxq 2018-04-05 26 98 https://twitter.com/karthiks/status/981886501456924672
I don’t know why people don’t get that a non-zero number that ends in zero is even.

This is absolutely bizarre https://t.co/fpZQB24l0a

2015-12-07 26 13 https://twitter.com/karthiks/status/673831614581899264
the one thing AAP has succeeded in doing is to tremendously increase my respect for LokSatta and @JP_LOKSATTA http://t.co/8hjA5l8IKN 2014-05-14 25 16 https://twitter.com/karthiks/status/466432370296762368
Coffee truck at avenue road. By coffee board voluntarily retired employees association. Brilliant coffee. Ten bucks. http://t.co/rBOgRh1F2l 2013-07-11 24 6 https://twitter.com/karthiks/status/355201588505223170
And I present you the way the parliamentary constituencies in Bangalore are demarcated

https://t.co/1tcGCimdG9 https://t.co/8QPIOzajN4

2019-03-26 24 59 https://twitter.com/karthiks/status/1110574864803323905

Till date, I’ve had FIVE tweets with more than a hundred retweets. I’ve had ELEVEN tweets with more than a hundred likes (including one where I’ve simply said “stud thread” and then linked to a thread written by someone else).

In other words, while I might have four thousand odd followers, the amplification of my tweets is rather minimal.

So maybe I should not tweet at all? And instead devote the time and effort spent in tweeting to other means? Maybe write another book instead?

What do you think?

Content Flooding

I just came across this nice article on content flooding, which is about how the same sort of content “floods” us from all possible sources. All newspapers and news website (not to mention news TV), at any point in time, are “flooding” us with articles about the same thing. This quote from the article possibly makes the point:

Ravi Somaiya wrote in the Columbia Journalism Review last year. In the cacophony of content and conversation around that content, the most familiar voices at the largest, fastest, trendiest outlets carry the farthest; according to SimilarWeb, which tracks website statistics, only five sites dominatearound 50 percent of the share of newspaper traffic in the U.S. CJR also reports that newspapers online, now with a borderless audience, publish more than twice as many stories as they used to, often with a much smaller staff. So what you get are dailies that operate like news channels, dissecting stories (sometimes even original ones) for ratings, which basically means they cover more of less . “Faced with a sea of headlines, in every permutation,” wrote Somaiya, “even the most determined mind rebels and begins to dismiss it all as noise.”

(ed: emphasis added)

Now I don’t intent to reinforce flooding, but I’ve written about this topic before, about how Twitter is like Times Now. However, in the last month or so, when I’ve been mostly off social media, one primary reason why I went off was to avoid flooding. Rather than getting lots of news about the same topic from all sorts of sources I want to learn about a variety of things.

And I’ve tried to tune my media consumption to try and avoid this kind of flooding. I’m off social media now, for there everyone talks about the same topic of the day most of the time. I haven’t watched news TV for some 10-15 years now. I get my article and blog content from RSS feeds (if you have any RSS feeds you think I might like, do share!), and from a bunch of newsletters I’m subscribed to (the article shared at the beginning of this post came from a newsletter by the Guardian).

And by myself being not part of the flooding monoculture (on twitter), I end up writing about stuff that other people may not be writing about at that point in time. And that’s my little contribution to reduce flooding in the world!

 

Links

While discussing podcasts, a friend remarked last week that one of the best things about podcasts is the discovery of new hitherto unknown people.

In response I said that this was the function that blogs used to perform a decade ago. Back in the day, blogs were full of links, and to other blogs. Every blog hosted a column of “favourite” blogs. You could look up people’s livejournal friends pages. People left comments on each other’s blogs, along with links to their blogs.

So as you consumed interesting blog posts, you would naturally get linked to other interesting blogs, and discover new people (incidentally this was how my wife and I discovered each other, but that’s a story for another day).

Where blogs scored over today’s podcasts, however,  was that as they directed you to hitherto unknown people, they also pointed you to the precise place where you could consume more of their stuff – in the form of a blog link. So if you linked to this blog, a reader who landed up here could then discover more of me – well beyond whatever of me you featured on your blog along with your link.

And this is a missing link in the podcast – while podcast episodes have links to the guest’s work, it is not an easy organic process to go through to this link and start consuming the guest’s work (except I guess in terms of twitter accounts). Moreover, the podcast is an audio medium, so it’s not natural to go to the podcast page and click through to the links.

This is one of the tragedies of the decline of blogging (clearly I’m one of the holdouts of the blogging era, maybe because it’s served me so well). Organic discovery of new people and content is not as great as it used to be. Well, Twitter and retweets exist, but the short nature of the format is that it’s much harder to judge if someone is worth following there.

Telling stories with data

I’m about 20% through with The Verdict by Prannoy Roy and Dorab Sopariwala. It’s a fascinating book, except for one annoyance – it is full of tables that serve no purpose but to break the flow of text.

I must mention that I’m reading the book on the Kindle, which means that the tables can pose a major annoyance. Text breaks off midway through one page, and the next couple of pages involve a table or two, with several lines of text explaining what’s in the table. And then the text continues. It makes for a rather disruptive reading experience. And some of the tables have just one data point – making one wonder why it has been inserted there at all.

This is not the first book that I’ve noticed that makes this mistake. Some of the sports analytics books I’ve read in recent times, such as The Numbers Game also make the same error (I read that in print, and still had the same disruption). Bhagwati and Panagariya’s Why Growth Matters is similarly unreadable. Tables abruptly inserted into the middle of text, leading to the reader losing flow in the reading.

Telling a data story in book length is a completely different challenge to telling one in article length. And telling a story with data is a complete art form. When you’re putting a table there, you need to be able to explain why that table is important to the story – rather than putting it there just because it seems more rigorous.

Also the exact placement of the table (something that can’t be controlled well in Kindle, but is easy to fix in either HTML or print) matters –  the table should be relevant to the piece of text immediately preceding and succeeding it, in a way that it doesn’t disrupt the reader’s flow. More importantly, the table should be able to add value at that particular point – perhaps building on something that has been described in the previous paragraph.

Book length makes it harder because people don’t normally expect tables and figures to disturb their reading flow when reading something of book length. Also, the book format means that it is not always possible to insert a table at a precise point (even in print, where pagination is an issue).

So how do you tell a book length story with data? Firstly, be very stingy about the data that you want to show – anything that doesn’t immediately add value should be banished to the appendix. Even the rigour, which academics might be particular about, can be pushed to the end notes (not footnotes, since those can be disruptive to flow as well, turning pages into half pages).

Then, once you know that showing a particular table or graph is inevitable to telling the story, put it either in the beginning or the end of a chapter. This way, it doesn’t break the reader’s flow. Then, refer to individual numbers in the middle of the text without having to put the entire table in there. Unless each and every data point in the table is important, banish it to the endnotes.

One other common mistake (I did it in my piece in Forbes published yesterday) is to put a big table and not talk about it. It only seeks to confuse the reader, who starts looking for explanations for everything in the table in later parts.

I guess authors and analysts tend to get possessive. If you have worked hard to produce insights from data, you seek to share as much of it as possible. And this can mean simply dumping data all the data in the piece without a regard for what the reader will do with it.

I’m making a note to myself to not repeat this mistake in future.

Vlogging!

The first seed was sown in my head by Harish “the Psycho” J, who told me a few months back that nobody reads blogs any more, and I should start making “analytics videos” to increase my reach and hopefully hit a new kind of audience with my work.

While the idea was great, I wasn’t sure for a long time what videos I could make. After all, I’m not the most technical guy around, and I had no patience for making videos on “how to use regression” and stuff like that. I needed a topic that would be both potentially catchy and something where I could add value. So the idea remained an idea.

For the last four or five years, my most common lunchtime activity has been to watch chess videos. I subscribe to the Youtube channels of Daniel King and Agadmator, and most days when I eat lunch alone at home are spent watching their analyses of games. Usually this routine gets disrupted on Fridays when the wife works from home (she positively hates these videos), but one Friday a couple of months back I decided to ignore her anyway and watch the videos (she was in her room working).

She had come out to serve herself to another serving of whatever she had made that day and saw me watching the videos. And suddenly asked me why I couldn’t make such videos as well. She has seen me work over the last seven years to build what I think is a fairly cool cricket visualisation, and said that I should use it to make little videos analysing cricket matches.

And since then my constant “background process” has been to prepare for these videos. Earlier, Stephen Rushe of Cricsheet used to unfailingly upload ball by ball data of all cricket matches as soon as they were done. However, two years back he went into “maintenance mode” and has stopped updating the data. And so I needed a method to get data as well.

Here, I must acknowledge the contributions of Joe Harris of White Ball Analytics, who not only showed me the APIs to get ball by ball data of cricket matches, but also gave very helpful inputs on how to make the visualisation more intuitive, and palatable to the normal cricket fan who hasn’t seen such a thing before. Joe has his own win probability model based on ball by ball data, which I think is possibly superior to mine in a lot of scenarios (my model does badly in high-scoring run chases), though I’ve continued to use my own model.

So finally the data is ready, and I have a much improved visualisation to what I had during the IPL last year, and I’ve created what I think is a nice app using the Shiny package that you can check out for yourself here. This covers all T20 international games, and you can use the app to see the “story of each game”.

And this is where the vlogging comes in – in order to explain how the model works and how to use it, I’ve created a short video. You can watch it here:

While I still have a long way to go in terms of my delivery, you can see that the video has come out rather well. There are no sync issues, and you see my face also in one corner. This was possible due to my school friend Sunil Kowlgi‘s Outklip app. It’s a pretty easy to use Chrome app, and the videos are immediately available on the platform. There is quick YouTube integration as well, for you to upload them.

And this is not a one time effort – going forward I’ll be making videos of limited overs games analysing them using my app, and posting them on my Youtube channel (or maybe I’ll make a new channel for these videos. I’ll keep you updated). I hope to become a regular Vlogger!

So in the meantime, watch the above video. And give my app a spin. Soon I’ll be releasing versions covering One Day Internationals and franchise T20s as well.

 

Pertinent Observations Grows Up

Over the weekend, I read Ben Blatt’s Nabokov’s Favourite Word Is Mauve, a simple natural language processing based analysis of hundreds of popular authors and their books. In this, Blatt uses several measures of goodness or badness of writing, and then measures different authors by it.

So he finds, for example, that Danielle Steel opens a lot of her books by talking about the weather, or that Charles Dickens uses a lot of “anaphora” (anyone who remembers the opening of A Tale of Two Cities shouldn’t be surprised by that). He also talks about the use of simple word counts to detect authorship of unknown documents (a separate post will come on that soon).

As someone who has already written a book (albeit nonfiction), I found a lot of this rather interesting, and constantly found myself trying to evaluate myself on the metrics with which Blatt subjected the famous authors to. And one metric that I found especially interesting was the “Flesch-Kincaid grade level“, which is a measure of complexity of language in a work.

It is a fairly simple formula, based on a linear combination of the average number of words per sentence and the average number of syllables per word. The formula goes like this:

Flesch-Kincaid Grade Score

And the result of the formula tells the approximate school grade of a reader who will be able to understand your writing. As you see, it is not a complex formula, and the shorter your sentences and shorter your words (measured in syllables), the simpler your prose is supposed to be.

The simplest works by this metric as mentioned in Blatt’s book are the works of Dr. Seuss, such as The Cat in the Hat or Green Eggs and Spam, on account of the exclusive usage of a small set of words in both books (Dr Seuss wrote the latter as a challenge, not unlike the challenges we would pose each other during “class participation” in business school). These books have a negative grade score, technically indicating that even a nursery kid should be able to read them, but actually meaning they’re simply easy to read.

Since the Flesch Kincaid Grade Score is based on a simple set of parameters (word count, sentence count and syllable count), it was rather simple for me to implement that on the posts from this blog.

I downloaded an XML export of all posts (I took this dump some two or three weeks ago), and then used R, with the Tidytext package to analyse the posts. Word count was most straightforward, using the str_count function in the stringr package (part of the tidyverse). Sentence count was a bit more complicated – there were no ready algorithms. I instead just searched for “sentence enders” (., ?, !, etc. I know the use of . in abbreviations creates problems but I can live with that).

Syllable count was the hardest. Again, there are some packages but it’s incredibly hard to use. Finally after much searching, I came across some code that again approximates this and used it.

Now that the technical stuff is done with, let’s get to the content. This word count, sentence count and syllable count all flow into calculating the Flesch-Kincaid (FK) score, which is the approximate class that one needs to be in to understand the text. Let’s just plot the FK score for all my blog posts (a total of 2341 of them) against time. I’ve added a regression line for good effect.

The trend is pretty clear. Over time, this blog has become more complicated and harder to read. In fact, drawing this graph slightly differently gives another message. This time, instead of a regression line, I’ve drawn a curve showing the trend.

When I started writing in 2004, I was at a 5th standard level. This increased steadily for the first two years (I gained a lot of my steady readership in this time) to get to about 8th standard, and plateaued there for a bit. And then again around 2009-10 there was n increase, as my blog got up to the 10th standard level. It’s pretty much stayed there ever since, apart from a tiny bump up in the end of 2014.

I don’t know if this increase in “complexity” of my blog is a good or a bad thing. On the one hand, it shows growing up. On the other, it’s becoming tougher to read, which has probably coincided with a plateauing (or even a drop) in the readership as well.

Let me know what you think – if you prefer this “grown up style”, or if you want to go back to the more simple writing I started off with.

Newsletter!

So after much deliberation and procrastination, I’ve finally started a newsletter. I call it “the art of data science” and the title should be self-explanatory. It’s pure unbridled opinion (the kind of which usually goes on this blog), except that I only write about one topic there.

I intend to have three sections and then a “chart of the edition” (note how cleverly I’ve named this section to avoid giving much away on the frequency of the newsletter!). This edition, though, I ended up putting too much harikathe, so I restricted to two sections before the chart.

I intend to talk a bit each edition about some philosophical part of dealing with data (this section got a miss this time), a bit on data analysis methods (I went a bit meta on this this time) and a little bit on programming languages (which I used for bitching a bit).

And that I plan to put a “chart of the edition” means I need to read newspapers a lot more, since you are much more likely to find gems (in either direction) there than elsewhere. For the first edition, I picked off a good graph I’d seen on Twitter, and it’s about Hull City!

Anyway, enough of this meta-harikathe. You can read the first edition of the newsletter here. In case you want to get it in your inbox each week/fortnight/whenever I decide to write it, then subscribe here!

And put feedback (by email, not comments here) on what you think of the newsletter!