Big Data and Fast Frugal Trees

In his excellent podcast episode with EconTalk’s Russ Roberts, psychologist Gerd Gigerenzer introduces the concept of “fast and frugal trees“. When someone needs to make decisions quickly, Gigerenzer says, they don’t take into account a large number of factors, but instead rely on a small set of thumb rules.

The podcast itself is based on Gigerenzer’s 2009 book Gut Feelings. Based on how awesome the podcast was, I read the book, but found that it didn’t offer too much more than what the podcast itself had to offer.

Coming back to fast and frugal trees..

In recent times, ever since “big data” became a “thing” in the early 2010s, it is popular for companies to tout the complexity of their decision algorithms, and machine learning systems. An easy way for companies to display this complexity is to talk about the number of variables they take into account while making a decision.

For example, you can have “fin-tech” lenders who claim to use “thousands of data points” on their prospective customers’ histories to determine whether to give out a loan. A similar number of data points is used to evaluate resumes and determine if a candidate should be called for an interview.

With cheap data storage and compute power, it has become rather fashionable to “use all the data available” and build complex machine learning models (which aren’t that complex to build) for decisions that were earlier made by humans. The problem with this is that this can sometimes result in over-fitting (system learning something that it shouldn’t be learning) which can lead to disastrous predictive power.

In his podcast, Gigerenzer talks about fast and frugal trees, and says that humans in general don’t use too many data points to make their decisions. Instead, for each decision, they build a quick “fast and frugal tree” and make their decision based on their gut feelings about a small number of data points. What data points to use is determined primarily based on their experience (not cow-like experience), and can vary by person and situation.

The advantage of fast and frugal trees is that the model is simple, and so has little scope for overfitting. Moreover, as the name describes, the decision process is rather “fast”, and you don’t have to collect all possible data points before you make a decision. The problem with productionising the fast and frugal tree, however, is that each user’s decision making process is different, and about how we can learn that decision making process to make the most optimal decisions at a personalised level.

How you can learn someone’s decision-making process (when you’ve assumed it’s a fast and frugal tree) is not trivial, but if you can figure it out, then you can build significantly superior recommender systems.

If you’re Netflix, for example, you might figure that someone makes their movie choices based only on age of movie and its IMDB score. So their screen is customised to show just these two parameters. Someone else might be making their decisions based on who the lead actors are, and they need to be shown that information along with the recommendations.

Another book I read recently was Todd Rose’s The End of Average. The book makes the powerful point that nobody really is average, especially when you’re looking a large number of dimensions, so designing for average means you’re designing for nobody.

I imagine that is one reason why a lot of recommender systems (Netflix or Amazon or Tinder) fail is that they model for the average, building one massive machine learning system, rather than learning each person’s fast and frugal tree.

The latter isn’t easy, but if it can be done, it can result in a significantly superior user experience!

What Makes The Athletic Great

In recent times I’ve bought subscriptions to two online media outlets – The Ken and The Athletic. I’d subscribed to the Ken a year ago, and was happy enough with the hit rate of their pieces (I’d find one in two pieces insightful) that I extended my subscription for three years earlier this year.

And since I did that extension, the product has been disappointing. They lost half their team to The Morning Context, a breakaway (and similar) outlet. They decided to expand in South East Asia, and since I have little interest in articles about that reason (at least not enough to pay for the writing), that automatically means less content that interest me. In some senses their quality is slipping. All this together means that I find less than one in five articles in The Ken compelling, and with the frequency of their publication (one article every weekday) I’m pretty disappointed.

Maybe it has to do with Marie Kondo’s popularity, or interest in behavioural economics research about the paradox of choice, but organisations are starting to make minimalism and limitations in inventory a virtue. The Ken started with the aim of “exactly one long form article every day”.

Having less choice, and being minimalistic, is good when this limited choice fits the appetite of the customer. However, if the choice isn’t particularly relevant, then minimalism becomes a bug rather than a feature – the customer doesn’t find what she is looking for and goes on to another outlet.

In that sense, I quite like the model of The Athletic, which I bought a year-long subscription to a year back. The Athletic’s model is just the opposite – massively high volumes with a highly curated personal feed. And maybe they’ve got their curation right, in terms of getting customers to click on the right kind of tags at the time of sign up, but so far I’ve found at least two useful articles on their site every single day since I turned up. And that’s insane value for money!

And that is despite me being interested in exactly one out of the nine sports that The Athletic covers (it’s mostly US-centric, and I don’t follow American sport at all. However I guess I’ll find it useful when I have to follow any controversy in American sport). And I’m interested in a subset of that – I follow one league (English Premier League) and games played by a handful of clubs in that league.

If I compare The Athletic to Netflix (both subscription-driven media outlets with large volumes of content), where the former scores is in its discoverability.

Maybe sport is easier compared to movies/tv shows in order to understand someone’s interests. Maybe it is that The Athletic, right up front, asked me to identify which sports, leagues, authors and teams I’m interested in (Netflix never made an attempt to do that). Maybe it is that The Athletic, with loads of fresh content every single day, is able to serve my preferences far easier than Netflix.

In any case, reading the Athletic makes me think that if I were to run a media outlet some day, I would want to follow that kind of a model – produce lots of content, so that lots of people will be interested in buying subscriptions, and then hope to use superior algorithms to make sure that people can see what they want and not have to cut through too much noise in order to do so!

The Business Standard is innumerate

I guess there is not that much information in the headline here – claiming that a bunch of journalists and editors are innumerate is like saying that the sky is blue. You would be hard-pressed to find journalists and editors who can actually parse numbers, though I must mention that I’ve been lucky enough to work with a few editors who actually understand arithmetic!

So what happened today? Basically in today’s front page, BS journalists (one Vinay Umarji in particular) and editors have displayed an utter lack of understanding on how relative grading and percentiles work. The context is CAT results, which came out yesterday.

(I’ve put a scan since the online version is behind a paywall).

There is information in saying that “number of candidates scoring 100 percentile is lowest in six years”, and the information I take out of that is that the number of test takers this year is the lowest in six years.

And for four of those six years, the numbers were inflated, since double the number of people who were supposed to get 100 percentile actually got 100 percentile. Since CAT percentiles are given to two decimal places, you get 100 percentile if you are in the top 0.005% of all candidates who took the exam. Or – if your “percentile” is higher than 99.995, it gets rounded up to 100.

For three years in the middle, the CAT administrators (usually they’re Quantitative Methods professors at IIMs), for whatever reason, rounded up everyone who got a percentile higher than 99.990 to 100. I’d written about that in my article for Mint three years back.

Coming back, CAT is an exam that follows relative grading. All that someone  has got “100 percentile” means is that they are within the top 0.005% of all candidates who wrote the exam. So if more candidates write the exam, more people will get “100 percentile”. In my time, for example (CAT 2003-4) some 1.3 lakh people had written the exam, so 7 of us got “100 percentile”. Nowadays the number of test takers has gone up, so more people get that score.

And then I found the rest of the article funny in a way as well, trying to do some sort of sociological analysis of the backgrounds of the people who had scored highly in the exam.

PS: The graph doesn’t give out much information (and I don’t know why the 2019 data point is missing there), but I guess it’s been put in there to make the journalists and editors seem more numerate than they are.


Experience and Cows

A lot of people make a big deal about experience. If some people (and some companies) are to be believed, the number of years in a job should be the only criterion of what someone needs to be paid and whether they deserve to be promoted.

However, not all experience is created equal. Experience matters when you are learning on the job, and where you learn the patterns that are inherent in your job, and you can over time replace your “slow thinking” about the job with more “fast thinking”.

If you continue to do the same thing in the same way throughout the years of experience, not bothering to figure out why things are done certain ways, and how things can be done better, the experience isn’t of that much use.

I leave it to former Tottenham Hotspur manager Mauricio Pochettino to explain this concept with a beautiful and profound analogy (there’s a video in this link which I’m somehow unable to embed here).

It is like a cow that, every day in 10 years, sees the train cross in front at the same time.

If you ask the cow, ‘what time is the train going to come’, it is not going to know the right answer.

In football, it is the same. Experience, yes, but hunger, motivation, circumstance, everything is so important.

It is unfortunate that the journalist who covered this story for Sky Sports thought this analogy was bizarre. Maybe he has been doing his job reporting on press conferences in the same way a cow sees a train passing by at a particular time every day?

Liverpool FC: Mid Season Review

After 20 games played, Liverpool are sitting pretty on top of the Premier League with 58 points (out of a possible 60). The only jitter in the campaign so far came in a draw away at Manchester United.

I made what I think is a cool graph to put this performance in perspective. I looked at Liverpool’s points tally at the end of the first 19 match days through the length of the Premier League, and looked at “progress” (the data for last night’s win against Sheffield isn’t yet up on my dataset, which also doesn’t include data for the 1992-93 season, so those are left out).

Given the strength of this season’s performance, I don’t think there’s that much information in the graph, but here it goes in any case:

I’ve coloured all the seasons where Liverpool were the title contenders. A few things stand out:

  1. This season, while great, isn’t that much better than the last one. Last season, Liverpool had three draws in the first half of the league (Man City at home, Chelsea away and Arsenal away). It was the first month of the second half where the campaign faltered (starting with the loss to Man City).
  2. This possibly went under the radar, but Liverpool had a fantastic start to the 2016-17 season as well, with 43 points at the halfway stage. To put that in perspective, this was one more than the points total at that stage in the title-chasing 2008-9 season.
  3. Liverpool went close in 2013-14, but in terms of points, the halfway performance wasn’t anything to write home about. That was also back in the time when teams didn’t dominate like nowadays, and eighty odd points was enough to win the league.

This is what Liverpool’s full season looked like (note that I’ve used a different kind of graph here. Not sure which one is better).


Finally, what’s the relationship between points at the end of the first half of the season (19 games) and the full season? Let’s run a regression across all teams, across all 38 game EPL seasons.

The regression doesn’t turn out to be THAT significant, with an R Squared of 41%. In other words, a team’s points tally at the halfway point in the season explains less than 50% of the variation in the points tally that the team will get in the second half of the season.

            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.42967    0.97671   9.655   <2e-16 ***
Midway       0.64126    0.03549  18.070   <2e-16 ***
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.992 on 478 degrees of freedom
  (20 observations deleted due to missingness)
Multiple R-squared:  0.4059,    Adjusted R-squared:  0.4046 
F-statistic: 326.5 on 1 and 478 DF,  p-value: < 2.2e-16

The interesting thing is that the coefficient of the midway score is less than 1, which implies that teams’ performances at the end of the season (literally) regress to the mean.

55 points at the end of the first 19 games is projected to translate to 100 at the end of the season. In fact, based on this regression model run on the first 19 games of the season, Liverpool should win the title by a canter.

PS: Look at the bottom of this projections table. It seems like for the first time in a very long time, the “magical” 40 points might be necessary to stave off relegation. Then again, it’s regression (pun intended).

Should I tweet at all?

This is not a rhetorical question.

I was doing some random data analysis today. I downloaded an archive of all my tweets, and of all my blog posts, and was looking at some simple statistics. I won’t bore you with a lot of the mundane details.

One thing that I must mention is that the hypothesis that twitter activity has an adverse impact on my blogging is disproved. I was looking at the number of words I’ve put on twitter each week and the number of words I’ve blogged in the same week. The two are uncorrelated.


In any case, so far I’ve tweeted 60,716 tweets over the course of eleven and a half years. My tweets include at total of 992453 words. Ignoring other handles, links and punctuation, maybe we can round this down to 950000. In other words, in nine and a half years I’ve tweeted nine and a half hundred thousand words.

Or that on twitter alone I write a hundred thousand words a year. 

The content of my book was about 52,000 words (IIRC). In other words, I write enough content for two books EACH YEAR on twitter. In 2013, I tweeted nearly four books worth of content.

That, however, is not the only reason why I wonder if I should tweet at all. While I’ve discovered a lot of interesting people, and made interesting connections, and can “semi-keep-in-touch” with people through twitter, I’m not really sure about the “impact” of my tweets.

I thought I’ll look at the tweets that have been most retweeted.

full_text Date retweet_count favorite_count Link
Why does the government / ruling party put out tweets with basic arithmetic errors? ?14.98+?9.02 is ?24 not ?27.44 2017-09-19 350 416
Remember that Richter scale is logarithmic. Base 10 if I’m not wrong. So 7.7 is 10 times as bad as 6.7 2015-04-25 171 40
Our @uber driver tonight was one Mr Akmal. He dropped us successfully. 2017-12-24 148 312
The greatest Hindi movie about Rajputs is Jaane Tu Ya Jaane Na 2017-11-24 142 299
Based on interim data, in 17 states NOTA has got more votes than AAP. #MintElections #MeaninglessComparisons 2014-05-16 134 19
A whopping 332 out of 542 constituencies in the just-concluded General Elections saw a two-way contest. Another 184 saw three-way contests.

In contrast, in the 2014 elections only 169 two-way contests, 278 three-cornered contests and 90 four-cornered contests

2019-05-24 95 174
“these dark days” is a euphemism for “people I didn’t vote for have formed the government” 2019-12-19 69 242
I have built this app that recommends single malt whiskies based on what you already like.

Details here:

2018-11-02 56 202
Amazing number of commies on this list RT @suar4sure: “@BookLuster: Which dictator killed the most People?” 2014-07-23 44 13
If BJP hadn’t split, numbers would have been: Cong: 91, BJP: 86, JDS: 35 @gkjohn 2013-05-08 39 3
Today @moneycontrolcom / @CNNnews18 have unleashed a monstrosity of a map. The map explains nothing, and nothing can explain the map!

2019-02-21 36 80
Stud thread 2018-07-22 34 105
there’s one piece of @ShashiTharoor ‘s writing that I’ve read multiple times – his blurb for my book. When I first read it, I was amazed at how precisely it communicated the idea of my book – much better than I’d ever managed to do. 2017-12-14 33 138
Did you know the cube root law of assembly size?

It’s a heuristic that states that the optimal size of a national assembly is the cube root of the population

2019-12-13 27 80
Great piece by Dheeraj Sanghi on the expulsion of students from IIT Roorkee: 2015-07-12 27 11
did the Chinese workers use One Belt to beat up the police, and then escape on One Road? 2018-04-05 26 98
I don’t know why people don’t get that a non-zero number that ends in zero is even.

This is absolutely bizarre

2015-12-07 26 13
the one thing AAP has succeeded in doing is to tremendously increase my respect for LokSatta and @JP_LOKSATTA 2014-05-14 25 16
Coffee truck at avenue road. By coffee board voluntarily retired employees association. Brilliant coffee. Ten bucks. 2013-07-11 24 6
And I present you the way the parliamentary constituencies in Bangalore are demarcated

2019-03-26 24 59

Till date, I’ve had FIVE tweets with more than a hundred retweets. I’ve had ELEVEN tweets with more than a hundred likes (including one where I’ve simply said “stud thread” and then linked to a thread written by someone else).

In other words, while I might have four thousand odd followers, the amplification of my tweets is rather minimal.

So maybe I should not tweet at all? And instead devote the time and effort spent in tweeting to other means? Maybe write another book instead?

What do you think?

Schelling segregation on High Streets

We’ve spoken about Thomas Schelling’s segregation model here before. The basic idea is this – people move houses if not enough people like them live around them. A simple rule is – if at least 3 of your 8 neighbours around you aren’t like you, you move.

And Schelling’s insight was that even such a simple rule – that you only need more than a third of neighbours like yourself  to stay in your place, when applied system wide, can quickly result in near-complete segregation.

I had done a quick simulation of Schelling’s model a few years back, and here is a picture from that

Of late I’ve started noticing this in retail as well. The operative phrase in the previous sentence is “I’ve started noticing”, for I think there is nothing new about this phenomenon.

Essentially retail outlets want to be located close to other stores that belong to the same category, or at least the same segment. One piece of rationale here is spillovers – someone who comes to a Louis Philippe store, upon not finding what they want, might want to hop over to the Arrow store next door. And then to the Woodland store across the road to buy shoes. And so on.

When a store is located with stores selling stuff targeted at a disjoint market, this spillover is lost.

And then there is the branding issue. A store that is located along with more downmarket stores risks losing its own brand value. This is one reason you see, across time, malls becoming segmented by the kind of stores they have.

A year and half back, I’d written about how the Jayanagar Shopping Complex “died”, thanks to non-increase of rents which resulted in cheap shops taking over, resulting in all the nicer shops moving out. In that I’d written:

On the other hand, the area immediately around the now-dying shopping complex has emerged as a brilliant retail destination.

And now I see this Schelling-ian game playing out in the area around the Jayanagar Shopping Complex. This is especially visible on two roads that attract a lot of shoppers – 11th main and 30th cross (which intersect at the Cool Joint junction).

These are two roads that have historically had a lot of good branded stores, but the way they’ve developed in the last year or so is interesting.

I don’t know if it has to do with drainage works that have been taking forever, but 32nd Cross seems to be moving more and more downmarket. A Woodland’s shoe store moved out. As did a Peter England store. Shree Sagar, which once served excellent chaats, now looks desolate.

The road has instead been taken over by stores selling “export reject garments” and knock down brands. And as I’ve observed over the last few months, these kind of shops continue take over more and more of the retail space on that road. In that sense, it is surprising that a new Jockey store took over three floors of a building on that road – seems completely out of character there. I expect it to move in short order.

I must mention here that over the last few years, the supply of retail space in Jayanagar has exploded, and that has automatically meant that all kinds of brands have space to operate there. It was only natural that a process takes place where certain roads become more upmarket than others.

Nevertheless, the way 30th cross (between 10th and 11th mains) and 10th main have visibly evolved over the last year or so is rather interesting.

Video Geographic Monopolies

There is one quirk about video which we don’t face with print – some content is simply impossible to access legally in some parts of the world.

I’m specifically talking about BBC’s Match Of The Day, their end of day highlights package covering the English Premier League. It was one show that I watched unfailingly during my time in London, both for the match highlights, and for the quality of the discussion featuring Gary Lineker, Alan Shearer, Ian Wright et al.

Now I find that the show is simply not available in India – some youtube channels illegally offer the show (before they are taken down, I guess), but without the bits that show pictures of the game (which they are not allowed to show). And that makes for rather painful watching, knowing that you’re watching something substandard.

This is not the case with something like text – as long as I’m willing to pay, I’m able to access content produced anywhere in the world. I can sit here in Bangalore and buy a subscription to the New York Times, and access all its content. Audio is also similar – I can sit here and subscribe to any international podcast, and be able to access the content.

Video doesn’t work that way. The problem is with the way rights are sold – the Star network, for example, has a monopoly on showing pictures from the Premier League in India (having paid a substantial amount for it). And part of their arrangement means that nobody else is allowed to broadcast this material in India. A consequence of this is that we are stuck with whatever (mostly crappy) analysis Star decides to provide around its games. Stuff that is unwatchable.

There is a lot of great sport content online, but the video part is constrained by the inability to show pictures. Check out analysis by Tifo Football, for example – it’s absolutely top class. However, for most games, they have to rely on stock images and block diagrams since they can’t show the pictures which someone has a monopoly on. And that makes the analysis less rich (the Athletic, which I have a subscription to, “solves” this in an interesting way – by using screenshots of the TV footage of the game as part of their text analysis).

I wonder if there is a way out of this. Some leagues such as the NBA have shown some enlightened thinking on this – while they are anal about copyright of their live feed, they don’t care about copyrights on recorded footage. This means that anyone can use footage from historical NBA games as part of their analysis. Better analysis means more people interested in the sport, which means more people watching the live feed, which makes more money for the league (read this excellent interview of NBA Commissioner Adam Silver).

I’m also beginning to think if there is a regulatory antitrust response to this issue. Video distribution (especially of live content) is a natural monopoly, so it doesn’t make sense to have competing broadcasters. However, I wonder if there is any regulation possible for historical feeds that makes them more tradable (with the rights holders getting appropriately compensated without much transaction costs)!

One can only hope..

Evolution of sports broadcasting

I had a pleasant surprise yesterday morning when I was watching the highlights of Liverpool’s 4-0 victory at Leicester. The picture quality suddenly looked better. The production aesthetics in the first few seconds (before coverage of the actual match began) looked “American”. I doubted myself for a minute if this was actually English football I was watching.

And then I remembered that the pictures for this  game came from Amazon Prime. The streaming service had got rights to broadcast two full rounds of Premier League games this season, making a small chink in the duopoly of Sky Sports and BT Sport.

Traditional media wasn’t too impressed by it. Streaming necessarily meant a small delay in broadcast, and that made it less exciting for some viewers. The Guardian predictably made a noise about the “corporate takeover” with Amazon’s entry. From all the reports I read (mostly across the Guardian and the Athletic), commentators seemed intent on picking holes in Amazon’s performance.

That said, the new broadcaster also brought a fresh production aesthetic. While there were the inevitable teething problems (I must confess I didn’t watch these games live – being midweek evening games, they were very late night in India), Amazon for sure brought some new ideas into the broadcast.

Just like Fox Sports had done when it had done a big launch into NFL broadcasting in the early 90s. Read this oral history of that episode. It’s rather fantastic. Among the “innovations” that Fox Sports brought into American broadcasting (based on its sports broadcast in Australia, primarily) was this box at the corner showing the time and the live score. The thing wasn’t initially well received, but is now a fixture.

For evolution to happen, you need sex. And that means mixing things up, in ways they weren’t mixed before. If we were all the children of a super-god and a super-goddess, we would all be pretty much the same since the amount of “innovation” that could happen would be limited. And things would be boring, and static. Complex forms such as human beings could have never happened.

It is similar in business, and sports broadcasting, as well. When you have the same channels covering the same sports, they get into well-set local optima, and nothing new is tried. There is no necessity for improvement in that sense.

When new players comes in, preferably from another market, however, they see the need to differentiate themselves, and bring in ideas from their former market. And this leads to a crossover of ideas. In their efforts to stand out and make an impact, they might also bring in some ideas never seen anywhere – “mutations” in the evolutionary sense.

A lot of them don’t make sense and they die out. Others score unexpected hits and catch on. And that way, this memetic evolution leads to better business.

The great thing about memetic evolution is that while bad ideas come along much more often than good ideas, they get discarded fairly quickly, while the good ideas live on. And that leads to overall better products.

Right now in India we have a duopoly in sports broadcasting, controlled by the Star family and the Sony family. I’ve ranted several times about how the latter is absolutely atrocious and does nothing to improve the game. Hopefully a new player getting rights of some sport here will shake things up and bring in fresh ideas. Even if some of the ideas turn out to be bad, there will be plenty of good ideas.

Check out the highlights of the Leicester-Liverpool game, and you’ll get an idea.

Who drives the Sleighs?

I admit I don’t know too many Christmas songs. I mean, I can recognise the tunes of quite a few, but there are very few of whom I know the lyrics. This is on account of us mainly trolling our school music teacher Samson when he would teach us these songs at this time of the year every year. For example, one year I remember we would sing all songs replacing keywords with “moTTe” (egg). “Joy to the world, the moTTe has come” etc.

So two songs whose lyrics I sort of know (primarily because the wife knows their lyrics well, and also because the daughter sings well), are Jingle Bells and Rudolph The Red Nosed Reindeer. And the two fundamentally contradict.

Dashing through the snow
In a one-horse open sleigh
Over the fields we go
Laughing all the way

So it’s a “one-horse open sleigh”, which suggests that Santa travels on a sleigh which is pulled by one animal, which happens to be a horse.

Now, consider the lyrics of the other song that I know:

Then one foggy Christmas Eve,
Santa came to say,
Rudolph with your nose so bright,
Won’t you guide my sleigh tonight

So this suggests that it is Rudolph the red-nosed reindeer who pulls the sleigh?

Isn’t there a fundamental contradiction in here? Is it the horse that pulls the sleigh or is it the reindeer? It’s extremely confusing!

I have only one explanation to this – it’s a one-horse sleigh that Rudolph the Reindeer “guides”. So you can think of Santa sitting on a sleigh that is pulled by a horse, with the reindeer travelling along as a guide of sorts (since the horse doesn’t know its way around Scandinavia).

What’s your explanation of it?

And while we are at it, wish you a Merry Christmas!