## Statistical analysis revisited – machine learning edition

Over ten years ago, I wrote this blog post that I had termed as a “lazy post” – it was an email that I’d written to a mailing list, which I’d then copied onto the blog. It was triggered by someone on the group making an off-hand comment of “doing regression analysis”, and I had set off on a rant about why the misuse of statistics was a massive problem.

Ten years on, I find the post to be quite relevant, except that instead of “statistics”, you just need to say “machine learning” or “data science”. So this is a truly lazy post, where I piggyback on my old post, to talk about the problems with indiscriminate use of data and models.

there is this popular view that if there is data, then one ought to do statistical analysis, and draw conclusions from that, and make decisions based on these conclusions. unfortunately, in a large number of cases, the analysis ends up being done by someone who is not very proficient with statistics and who is basically applying formulae rather than using a concept. as long as you are using statistics as concepts, and not as formulae, I think you are fine. but you get into the “ok i see a time series here. let me put regression. never mind the significance levels or stationarity or any other such blah blah but i’ll take decisions based on my regression” then you are likely to get into trouble.

The modern version of this is – everybody wants to do “big data” and “data science”. So if there is some data out there, people will want to draw insights from it. And since it is easy to apply machine learning models (thanks to open source toolkits such as the scikit-learn package in Python), people who don’t understand the models indiscriminately apply it on the data that they have got. So you have people who don’t really understand data or machine learning working with those, and creating models that are dangerous.

As long as people have idea of the models they are using, and the assumptions behind them, and the quality of data that goes into the models, we are fine. However, we are increasingly seeing cases of people using improper or biased data and applying models they don’t understand on top of them, that will have impact that affect the wider world.

So the problem is not with “artificial intelligence” or “machine learning” or “big data” or “data science” or “statistics”. It is with the people who use them incorrectly.

## Big Data and Fast Frugal Trees

In his excellent podcast episode with EconTalk’s Russ Roberts, psychologist Gerd Gigerenzer introduces the concept of “fast and frugal trees“. When someone needs to make decisions quickly, Gigerenzer says, they don’t take into account a large number of factors, but instead rely on a small set of thumb rules.

The podcast itself is based on Gigerenzer’s 2009 book Gut Feelings. Based on how awesome the podcast was, I read the book, but found that it didn’t offer too much more than what the podcast itself had to offer.

Coming back to fast and frugal trees..

In recent times, ever since “big data” became a “thing” in the early 2010s, it is popular for companies to tout the complexity of their decision algorithms, and machine learning systems. An easy way for companies to display this complexity is to talk about the number of variables they take into account while making a decision.

For example, you can have “fin-tech” lenders who claim to use “thousands of data points” on their prospective customers’ histories to determine whether to give out a loan. A similar number of data points is used to evaluate resumes and determine if a candidate should be called for an interview.

With cheap data storage and compute power, it has become rather fashionable to “use all the data available” and build complex machine learning models (which aren’t that complex to build) for decisions that were earlier made by humans. The problem with this is that this can sometimes result in over-fitting (system learning something that it shouldn’t be learning) which can lead to disastrous predictive power.

In his podcast, Gigerenzer talks about fast and frugal trees, and says that humans in general don’t use too many data points to make their decisions. Instead, for each decision, they build a quick “fast and frugal tree” and make their decision based on their gut feelings about a small number of data points. What data points to use is determined primarily based on their experience (not cow-like experience), and can vary by person and situation.

The advantage of fast and frugal trees is that the model is simple, and so has little scope for overfitting. Moreover, as the name describes, the decision process is rather “fast”, and you don’t have to collect all possible data points before you make a decision. The problem with productionising the fast and frugal tree, however, is that each user’s decision making process is different, and about how we can learn that decision making process to make the most optimal decisions at a personalised level.

How you can learn someone’s decision-making process (when you’ve assumed it’s a fast and frugal tree) is not trivial, but if you can figure it out, then you can build significantly superior recommender systems.

If you’re Netflix, for example, you might figure that someone makes their movie choices based only on age of movie and its IMDB score. So their screen is customised to show just these two parameters. Someone else might be making their decisions based on who the lead actors are, and they need to be shown that information along with the recommendations.

Another book I read recently was Todd Rose’s The End of Average. The book makes the powerful point that nobody really is average, especially when you’re looking a large number of dimensions, so designing for average means you’re designing for nobody.

I imagine that is one reason why a lot of recommender systems (Netflix or Amazon or Tinder) fail is that they model for the average, building one massive machine learning system, rather than learning each person’s fast and frugal tree.

The latter isn’t easy, but if it can be done, it can result in a significantly superior user experience!

## Yet another “big data whisky”

A long time back I had used a primitive version of my Single Malt recommendation app to determine that I’d like Ardbeg. Presently, the wife was travelling to India from abroad, and she got me a bottle. We loved it.

And so I had screenshots from my app stored on my phone all the time, to be used while at duty frees, so I would know what whiskies to buy.

And then about a year back, we started planning a visit to Scotland. If you remember, we were living in London then, and my wife’s cousin and her family were going to visit us over Christmas. And the plan was to go to the Scottish Highlands for a few days. And that had to include a distillery tour.

Out came my app again, to determine which distillery to visit. I had made a scatter plot (which I have unfortunately lost since) with the distance from Inverness (where we were going to be based) on one axis, and the likelihood of my wife and I liking a whisky (based on my app) on the other (by this time, Ardbeg was firmly in the “calibration set”).

The clear winner was Clynelish – it was barely 100 kilometers away from Inverness, promised a nice drive there, and had a very high similarity score to the stuff that we liked. I presently called them to make a booking for a distillery tour. The only problem was that it’s a Diageo distillery, and Diageo distillery doesn’t allow kids inside (we were travelling with three of them).

I was proud of having planned my vacation “using data science”. I had made up a blog post in my head that I was going to write after the vacation. I was basically picturing “turning around to the umpire and shouting ‘howzzat'”. And then my hopes were dashed.

A week after I had made the booking, I got a call back from the distillery informing me that it was unfortunately going to be closed during our vacation, and so we couldn’t visit. My heart sank. We finally had to make do with two distilleries that were pretty close to Inverness, but which didn’t rate highly according to my app.

My cousin-in-law-in-law and I first visited Glen Ord, another Diageo distillery, leaving our wives and kids back in the hotel. The tour was nice, but the whisky at the distillery was rather underwhelming. The high point was the fact that Glen Ord also supplies highly peated malt to other Diageo distilleries such as Clynelish (which we couldn’t visit) and Talisker (one of my early favourites).

A day later, we went to the more family friendly Tomatin distillery, to the south of Inverness (so we could carry my daughter along for the tour. She seemed to enjoy it. The other kids were asleep in the car with their dad). The tour seemed better there, but their flagship whisky seemed flat. And then came Cu Bocan, a highly peated whisky that they produce in very limited quantities and distribute in a limited fashion.

Initially we didn’t feel anything, but then the “smoke hit from the back”. Basically the initial taste of the whisky was smooth, but as you swallowed it, the peat would hit you. It was incredibly surreal stuff. We sat at the distillery’s bar for a while downing glasses full of Cu Bocan.

The cousin-in-law-in-law quickly bought a bottle to take back to Singapore. We dithered, reasoning we could “use Amazon to deliver it to our home in London”. The muhurta for the latter never arrived, and a few months later we were on our way to India. Travelling with six suitcases and six handbags and a kid meant that we were never going to buy duty free stuff on our way home (not that Cu Bocan was available in duty free).

In any case, Clynelish is also not widely available in duty free shops, so we couldn’t have that as well for a long time. And then we found an incredibly well stocked duty free shop in Maldives, on our way back from our vacation there in August. A bottle was duly bought.

And today the auspicious event arrived for the bottle to be opened. And it’s spectacular. A very different kind of peat than Lagavulin (a bottle of which we just finished yesterday). This one hits the mouth from both the front and the back.

And I would like to call Clynelish the “new big data whisky”, having discovered it through my app, almost going there for a distillery tour, and finally tasting it a year later.

Highly recommended! And I’d highly recommend my app as well!

Cheers!

## Segmentation and machine learning

For best results, use machine learning to do customer segmentation, but then get humans with domain knowledge to validate the segments

There are two common ways in which people do customer segmentation. The “traditional” method is to manually define the axes through which the customers will get segmented, and then simply look through the data to find the characteristics and size of each segment.

Then there is the “data science” way of doing it, which is to ignore all intuition, and simply use some method such as K-means clustering and “do gymnastics” with the data and find the clusters.

A quantitative extreme of this method is to do gymnastics with your data, get segments out of it, and quantitatively “take action” on it without really bothering to figure out what each clusters represent. Loosely speaking, this is how a lot of recommendation systems nowadays work – some algorithm somewhere finds people similar to you based on your behaviour, and recommends to you what they liked.

I usually prefer a sort of middle ground. I like to let the algorithms (k-means easily being my favourite) to come up with the segments based on the data, and then have a bunch of humans look at the segments and make sense of it.

Basically whatever segments are thrown up by the algorithm need to be validated by human intuition. Getting counterintuitive clusters is also not a problem – on several occasions, people I’ve validated the clusters by (usually clients) have used the counterintuitive clusters to discover bugs, gaps in the data  or patterns that they didn’t know of earlier.

Also, in terms of validation of clusters, it is always useful to get people with domain knowledge to validate the clusters. And this also means that whatever clusters you’ve generated you are able to represent them in a human-readable format. The best way of doing that is to use the cluster centres and then represent them somehow in a “physical” manner.

I started writing this post some three days ago and am only getting to finish it now. Unfortunately, in the meantime I’ve forgotten the exact motivation of why I started writing this. If i recall that, I’ll maybe do another post.

## Taking Intelligence For Granted

There was a point in time when the use of artificial intelligence or machine learning or any other kind of intelligence in a product was a source of competitive advantage and differentiation. Nowadays, however, many people have got so spoiled by the use of intelligence in many products they use that it has become more of a hygiene factor.

Take this morning’s post, for example. One way to look at it is that Spotify with its customisation algorithms and recommendations has spoiled me so much that I find Amazon’s pushing of Indian music irritating (Amazon’s approach can be called as “naive customisation”, where they push Indian music to me only because I’m based in India, and not learn further based on my preferences).

Had I not been exposed to the more intelligent customisation that Spotify offers, I might have found Amazon’s naive customisation interesting. However, Spotify’s degree of customisation has spoilt me so much that Amazon is simply inadequate.

This expectation of intelligence goes beyond product and service classes. When we get used to Spotify recommending music we like based on our preferences, we hold Netflix’s recommendation algorithm to a higher standard. We question why the Flipkart homepage is not customised to us based on our previous shopping. Or why Google Maps doesn’t learn that some of us don’t like driving through small roads when we can help it.

That customers take intelligence for granted nowadays means that businesses have to invest more in offering this intelligence. Easy-to-use data analysis and machine learning packages mean that at least some part of an industry uses intelligence in at least some form (even if they might do it badly in case they fail to throw human intelligence into the mix!).

So if you are in the business of selling to end customers, keep in mind that they are used to seeing intelligence everywhere around them, and whether they state it or not, they expect it from you.

## More on statistics and machine learning

I’m thinking of a client problem right now, and I thought that something that we need to predict can be modelled as a function of a few other things that we will know.

Initially I was thinking about it from the machine learning perspective, and my thought process went “this can be modelled as a function of X, Y and Z. Once this is modelled, then we can use X, Y and Z to predict this going forward”.

And then a minute later I context switched into the statistical way of thinking. And now my thinking went “I think this can be modelled as a function of X, Y and Z. Let me build a quick model to see if the goodness of fit, and whether a signal actually exists”.

Now this might reflect my own biases, and my own processes for learning to do statistics and machine learning, but one important difference I find is that in statistics you are concerned about the goodness of fit, and whether there is a “signal” at all.

While in machine learning as well we look at what the predictive ability is (area under ROC curve and all that), there is a bit of delay in the process between the time we model and the time we look for the goodness of fit. What this means is that sometimes we can get a bit too certain about the models that we want to build without thinking if in the first place they make sense and there’s a signal in that.

For example, in the machine learning world, the concept of R Square is not defined for regression –  the only thing that matters is how well you can predict out of sample. So while you’re building the regression (machine learning) model, you don’t have immediate feedback on what to include and what to exclude and whether there is a signal.

I must remind you that machine learning methods are typically used when we are dealing with really high dimensional data, and where the signal usually exists in the interplay between explanatory variables rather than in a single explanatory variable. Statistics, on the other hand, is used more for low dimensional problems where each variable has reasonable predictive power by itself.

It is possibly a quirk of how the two disciplines are practiced that statistics people are inherently more sceptical about the existence of signal, and machine learning guys are more certain that their model makes sense.

What do you think?

## Data, football and astrology

Jonathan Wilson has an amusing article on data and football, and how many data-oriented managers in football have also been incredibly superstitious.

This is in response to BT Sport’s (one of the UK broadcasters of the Premier League) announcement of it’s “Unscripted” promotion where “some of the world’s foremost experts in both sports and artificial intelligence to produce a groundbreaking prophecy of the forthcoming season”.

Wilson writes:

I was reminded also of the 1982 film adaptation of Agatha Christie’s 1939 novel Murder is Easy in which a computer scientist played by Bill Bixby enters the details of the case into a programme he has coded to give the name of the murderer. As it turns out, the programmer knows this is nonsense and is merely trying to gauge the reaction of the heroine, played by Lesley-Anne Down, when her name flashes on the screen.

But this, of course, is not what data-based analysis is for. Its predictive element deals in probability not prophecy. It is not possessed of some oracular genius. (That said, it is an intriguing metaphysical question: what if you had all the data, not just ability and fitness, but every detail of players’ diet, relationships and mental state, the angle of blades of grass on the pitch, an assessment of how the breathing of fans affected air flow in the stadium … would the game’s course then be inevitable?)

This reminded me of my own piece that I wrote last year about how data science “is simply the new astrology“.

## Periodicals and Dashboards

The purpose of a dashboard is to give you a live view of what is happening with the system. Take for example the instrument it is named after – the car dashboard. It tells you at the moment what the speed of the car is, along with other indicators such as which lights are on, the engine temperature, fuel levels, etc.

Not all reports, however, need to be dashboards. Some reports can be periodicals. These periodicals don’t tell you what’s happening at a moment, but give you a view of what happened in or at the end of a certain period. Think, for example, of classic periodicals such as newspapers or magazines, in contrast to online newspapers or magazines.

Periodicals tell you the state of a system at a certain point in time, and also give information of what happened to the system in the preceding time. So the financial daily, for example, tells you what the stock market closed at the previous day, and how the market had moved in the preceding day, month, year, etc.

Doing away with metaphors, business reporting can be classified into periodicals and dashboards. And they work exactly like their metaphorical counterparts. Periodical reports are produced periodically and tell you what happened in a certain period or point of time in the past. A good example are company financials – they produce an income statement and balance sheet to respectively describe what happened in a period and at a point in time for the company.

Once a periodical is produced, it is frozen in time for posterity. Another edition will be produced at the end of the next period, but it is a new edition. It adds to the earlier periodical rather than replacing it. Periodicals thus have historical value and because they are preserved they need to be designed more carefully.

Dashboards on the other hand are fleeting, and not usually preserved for posterity. They are on the other hand overwritten. So whether all systems are up this minute doesn’t matter a minute later if you haven’t reacted to the report this minute, and thus ceases to be of importance the next minute (of course there might be some aspects that might be important at the later date, and they will be captured in the next periodical).

When we are designing business reports and other “business intelligence systems” we need to be cognisant of whether we are producing a dashboard or a periodical. The fashion nowadays is to produce everything as a dashboard, perhaps because there are popular dashboarding tools available.

However, dashboards are expensive. For one, they need a constant connection to be maintained to the “system” (database or data warehouse or data lake or whatever other storage unit in the business report sense). Also, by definition they are not stored, and if you need to store then you have to decide upon a frequency of storage which makes it a periodical anyway.

So companies can save significantly on resources (compute and storage) by switching from dashboards (which everyone seems to think in terms of) to periodicals. The key here is to get the frequency of the periodical right – too frequent and people will get bugged. Not frequent enough, and people will get bugged again due to lack of information. Given the tools and technologies at hand, we can even make reports “on demand” (for stuff not used by too many people).

## Telling stories with data

I’m about 20% through with The Verdict by Prannoy Roy and Dorab Sopariwala. It’s a fascinating book, except for one annoyance – it is full of tables that serve no purpose but to break the flow of text.

I must mention that I’m reading the book on the Kindle, which means that the tables can pose a major annoyance. Text breaks off midway through one page, and the next couple of pages involve a table or two, with several lines of text explaining what’s in the table. And then the text continues. It makes for a rather disruptive reading experience. And some of the tables have just one data point – making one wonder why it has been inserted there at all.

This is not the first book that I’ve noticed that makes this mistake. Some of the sports analytics books I’ve read in recent times, such as The Numbers Game also make the same error (I read that in print, and still had the same disruption). Bhagwati and Panagariya’s Why Growth Matters is similarly unreadable. Tables abruptly inserted into the middle of text, leading to the reader losing flow in the reading.

Telling a data story in book length is a completely different challenge to telling one in article length. And telling a story with data is a complete art form. When you’re putting a table there, you need to be able to explain why that table is important to the story – rather than putting it there just because it seems more rigorous.

Also the exact placement of the table (something that can’t be controlled well in Kindle, but is easy to fix in either HTML or print) matters –  the table should be relevant to the piece of text immediately preceding and succeeding it, in a way that it doesn’t disrupt the reader’s flow. More importantly, the table should be able to add value at that particular point – perhaps building on something that has been described in the previous paragraph.

Book length makes it harder because people don’t normally expect tables and figures to disturb their reading flow when reading something of book length. Also, the book format means that it is not always possible to insert a table at a precise point (even in print, where pagination is an issue).

So how do you tell a book length story with data? Firstly, be very stingy about the data that you want to show – anything that doesn’t immediately add value should be banished to the appendix. Even the rigour, which academics might be particular about, can be pushed to the end notes (not footnotes, since those can be disruptive to flow as well, turning pages into half pages).

Then, once you know that showing a particular table or graph is inevitable to telling the story, put it either in the beginning or the end of a chapter. This way, it doesn’t break the reader’s flow. Then, refer to individual numbers in the middle of the text without having to put the entire table in there. Unless each and every data point in the table is important, banish it to the endnotes.

One other common mistake (I did it in my piece in Forbes published yesterday) is to put a big table and not talk about it. It only seeks to confuse the reader, who starts looking for explanations for everything in the table in later parts.

I guess authors and analysts tend to get possessive. If you have worked hard to produce insights from data, you seek to share as much of it as possible. And this can mean simply dumping data all the data in the piece without a regard for what the reader will do with it.

I’m making a note to myself to not repeat this mistake in future.

## The problem with spider charts

On FiveThirtyEight, Nate Silver has a piece looking ahead to the Democratic primaries ahead of the presidential elections in the US next year. I don’t know enough about US politics to comment on the piece itself, but what caught my eye is the spider chart describing the various Democratic nominees.

This is a standard spider chart that people who read business news should recognise, so the appearance of such a chart isn’t big news. What bothers me, though, is that a respected data journalist like Nate Silver is publishing such charts, especially in an article under his own name. For spider charts do a lousy job of conveying information.

Implicitly, you might think that the area of the pentagon (in this case) thus formed conveys the strength of a particular candidate. Leaving aside the fact that the human eye can judge areas less well than lengths, the area of a spider chart accurately shows “strength” only in one corner case – where the values along all five axes are the same.

In all other cases, such as in the spider charts  above, the area of the pentagon (or whatever-gon) thus formed depends on the order in which the factors are placed. For example, in this chart, why should black voters be placed between the asian/hispanic and millennials? Why should party loyalists lie between the asian/hispanics and the left?

I may not have that much insight into US politics, but it should be fairly clear that the ordering of the factors in this case has no particular sanctity. You should be able to jumble up the order of the axes and the information in the chart should remain the same.

The spider chart doesn’t work this way. If lengths of the “semidiagonals” (the five axes on which we are measuring) are $l_1, l_2, ... l_n$, the area of the polygon thus formed equals $\frac{1}{2} sin \frac{360}{n} (l_1.l_2 + l_2.l_3 + ... + l_n.l_1)$. It is not hard to see that for any value of $n \ge 4$, the ordering of the “axes” makes a material difference in the area of the chart.

Moreover, in this particular case, with the legend being shown only with one politician, you need to keep looking back and forth to analyse where a particular candidate lies in terms of support among the five big democrat bases. Also, the representation suggests that these five bases have equal strength in the Democrat support base, while the reality may be far from it (again I don’t have domain knowledge).

Spider charts can look pretty, which might make them attractive for graphic designers. They are just not so good in conveying information.

PS: for this particular data set, I would just go with bars with small multiples (call me boring if you may). One set of bar graphs for each candidate, with consistent colour coding and ordering among the bars so that candidates can be compared easily.