Statistics and machine learning

So a group of statisticians (from Cyprus and Greece) have written an easy-to-read paper comparing statistical and machine learning methods in time series forecasting, and found that statistical methods do better, both in terms of accuracy and computational complexity.

To me, there’s no surprise in the conclusion, since in the statistical methods, there is some human intelligence involved, in terms of removing seasonality, making the time series stationary and then using statistical methods that have been built specifically for time series forecasting (including some incredibly simple stuff like exponential smoothing).

Machine learning methods, on the other hand, are more general purpose – the same neural networks used for forecasting these time series, with changed parameters, can be used for predicting something else.

In a way, using machine learning for time series forecasting is like using that little screwdriver from a Swiss army knife, rather than a proper screwdriver. Yes, it might do the job, but it’s in general inefficient and not an effective use of resources.

Yet, it is important that this paper has been written since the trend in industry nowadays has been that given cheap computing power, machine learning be used for pretty much any problem, irrespective of whether it is the most appropriate method for doing so. You also see the rise of “machine learning purists” who insist that no human intelligence should “contaminate” these models, and machines should do everything.

By pointing out that statistical techniques are superior at time series forecasting compared to general machine learning techniques, the authors bring to attention that using purpose-built techniques can actually do much better, and that we can build better systems by using a combination of human and machine intelligence.

They also helpfully include this nice picture that summarises what machine learning is good for, and I wholeheartedly agree: 

The paper also has some other gems. A few samples here:

Knowing that a certain sophisticated method is not as accurate as a much simpler one is upsetting from a scientific point of view as the former requires a great deal of academic expertise and ample computer time to be applied.

 

[…] the post-sample predictions of simple statistical methods were found to be at least as accurate as the sophisticated ones. This finding was furiously objected to by theoretical statisticians [76], who claimed that a simple method being a special case of e.g. ARIMA models, could not be more accurate than the ARIMA one, refusing to accept the empirical evidence proving the opposite.

 

A problem with the academic ML forecasting literature is that the majority of published studies provide forecasts and claim satisfactory accuracies without comparing them with simple statistical methods or even naive benchmarks. Doing so raises expectations that ML methods provide accurate predictions, but without any empirical proof that this is the case.

 

At present, the issue of uncertainty has not been included in the research agenda of the ML field, leaving a huge vacuum that must be filled as estimating the uncertainty in future predictions is as important as the forecasts themselves.

Beer and diapers: Netflix edition

When we started using Netflix last May, we created three personas for the three of us in the family – “Karthik”, “Priyanka” and “Berry”. At that time we didn’t realise that there was already a pre-created “kids” (subsequently renamed “children” – don’t know why that happened) persona there.

So while Priyanka and I mostly use our respective personas to consume Netflix (our interests in terms of video content hardly intersect), Berry uses both her profile and the kids profile for her stuff (of course, she’s too young to put it on herself. We do it for her). So over the year, the “Berry” profile has been mostly used to play Peppa Pig, and the occasional wildlife documentary.

Which is why we were shocked the other day to find that “Real life wife swap” had been recommended on her account. Yes, you read that right. We muttered a word of abuse about Netflix’s machine learning algorithms and since then have only used the “kids” profile to play Berry’s stuff.

Since then I’ve been wondering what made Netflix recommend “real life wife swap” to Berry. Surely, it would have been clear to Netflix that while it wasn’t officially classified as one, the Berry persona was a kid’s account? And even if it didn’t, didn’t the fact that the account was used for watching kids’ stuff lead the collaborative filtering algorithms at Netflix to recommend more kids’ stuff? I’ve come up with various hypotheses.

Since I’m not Netflix, and I don’t have their data, I can’t test it, but my favourite hypothesis so far involves what is possibly the most commonly cited example in retail analytics – “beer and diapers“. In this most-likely-apocryphal story, a supermarket chain discovered that beer and diapers were highly likely to appear together in shopping baskets. Correlation led to causation and a hypothesis was made that this was the result of tired fathers buying beer on their diaper shopping trips.

So the Netflix version of beer-and-diapers, which is my hypothesis, goes like this. Harrowed parents are pestered by their kids to play Peppa Pig and other kiddie stuff. The parents are so stressed that they don’t switch to the kid’s persona, and instead play Peppa Pig or whatever from their own accounts. The kid is happy and soon goes to bed. And then the parent decides to unwind by watching some raunchy stuff like “real life wife swap”.

Repeat this story in enough families, and you have a strong enough pattern that accounts not explicitly classified as “kids/children” have strong activity of both kiddie stuff and adult content. And when you use an account not explicitly mentioned as “kids” to watch kiddie stuff, it gets matched to these accounts that have created the pattern – Netflix effectively assumes that watching kid stuff on an adult account indicates that the same account is used to watch adult content as well. And so serves it to Berry!

Machine learning algorithms basically work on identifying patterns in data, and then fitting these patterns on hitherto unseen data. Sometimes the patterns make sense – like Google Photos identifying you even in your kiddie pics. Other times, the patterns are offensive – like the time Google Photos classified a black woman as a “gorilla“.

Thus what is necessary is some level of human oversight, to make sure that the patterns the machine has identified makes some sort of sense (machine learning purists say this is against the spirit of machine learning, since one of the purposes of machine learning is to discover patterns not perceptible to humans).

That kind of oversight at Netflix would have suggested that you can’t tag a profile to a “kiddie content AND adult content” category if the profile has been used to watch ONLY kiddie content (or ONLY adult content). And that kind of oversight would have also led Netflix to investigate issues of users using “general” account for their kids, and coming up with an algorithm to classify such accounts as kids’ accounts, and serve only kids’ content there.

It seems, though, that algorithms run supreme at Netflix, and so my baby daughter gets served “real life wife swap”. Again, this is all a hypothesis (real life wife swap being recommended is a fact, of course)!

Algorithmic curation

When I got my first smartphone (a Samsung Galaxy Note 2) in 2013, one of the first apps I installed on it was Flipboard. I’d seen the app while checking out some phones at either the Apple or Samsung retail outlets close to my home, and it seemed like a rather interesting idea.

For a long time, Flipboard was my go-to app to check the day’s news, as it conveniently categorised news into “tech”, “business” and “sport” and learnt about my preferences and fed me stuff I wanted. And then after some update, it suddenly stopped working – somehow it started serving too much stuff I didn’t want to read about, and when I tuned (by “following” and “unfollowing” topics) my feed, it progressively got worse.

I stopped using it some 2 years back, but out of curiosity started using it again recently. While it did throw up some nice articles, there is too much unwanted stuff in the app. More precisely, there’s a lot of “clickbaity” stuff (“10 things about Narendra Modi you would never want to know” and the like) in my feed, meaning I have to wade through a lot of such articles to find the occasional good ones.

(Aside: I dedicate about half a chapter to this phenomenon in my book. The technical term is “congestion”. I talk about it in the context of markets in relationships and real estate)

Flipboard is not the only one. I use this app called Pocket to bookmark long articles and read later. A couple of years back, Pocket started giving “recommendations” based on what I’d read and liked. Initially it was good, and mostly curated from what my “friends” on Pocket recommended. Now, increasingly I’m getting clickbaity stuff again.

I stopped using Facebook a long time before they recently redesigned their newsfeed (to give more weight to friends’ stuff than third party news), but I suspect that one of the reasons they made the change was the same – the feed was getting overwhelmed with clickbaity stuff, which people liked but didn’t really read.

Basically, there seems to be a widespread problem in a lot of automatically curated news feeds. To put it another way, the clickbaity websites seem to have done too well in terms of gaming whatever algorithms the likes of Facebook, Flipboard and Pocket use to build their automated recommendations.

And more worryingly, with all these curators starting to do badly around the same time (ok this is my empirical observation. Given few data points I might be wrong), it suggests that all automated curation algorithms use a very similar algorithm! And that can’t be a good thing.

Stirring the pile efficiently

Warning: This is a technical post, and involves some code, etc. 

As I’ve ranted a fair bit on this blog over the last year, a lot of “machine learning” in the industry can be described as “stirring the pile”. Regular readers of this blog will be familiar with this image from XKCD by now:

Source: https://xkcd.com/1838/

Basically people simply take datasets and apply all the machine learning techniques they have heard of (implementation is damn easy – scikit learn allows you to implement just about any model in three similar looking lines of code; See my code here to see how similar the implementation is).

So I thought I’ll help these pile-stirrers by giving some hints of what method to use for different kinds of data. I’ve over-simplified stuff, and so assume that:

  1. There are two predictor variables X and Y. The predicted variable “Z” is binary.
  2. X and Y are each drawn from a standard normal distribution.
  3. The predicted variable Z is “clean” – there is a region in the X-Y plane where Z is always “true” and another region where Z is always “false”
  4. So the idea is to see which machine learning techniques are good at identifying which kind of geometrical figures.
  5. Everything is done “in-sample”. Given the nature of the data, it doesn’t matter if we do it in-sample or out-of-sample.

For those that understand Python (and every pile-stirrer worth his salt is excellent at Python), I’ve put my code in a nice Jupyter Notebook, which can be found here.

So this is what the output looks like. The top row shows the “true values” of Z. Then we have a row for each of the techniques we’ve used, which shows how well these techniques can identify the pattern given in the top row (click on the image for full size).

As you can see, I’ve chosen some common geometrical shapes and seen which methods are good at identifying those. A few pertinent observations:

  1. Logistic regression and linear SVM are broadly similar, and both are shit for this kind of dataset. Being linear models, they fail to deal with non-linear patterns
  2. SVM with RBF kernel is better, but it fails when there are multiple “true regions” in the dataset. At least it’s good at figuring out some non-linear patterns. However, it can’t figure out the triangle or square – it draws curves around them, instead.
  3. Naive Bayesian (I’ve never understood this even though I’m pretty good at Bayesian statistics, but I understand this is a commonly used technique; and I’ve used default parameters so not sure how it is “Bayesian” even) can identify some stuff but does badly when there are disjoint regions where Z is true.
  4. Ensemble methods such as Random Forests and Gradient Boosting do rather well on all the given inputs. They do well for both polygons and curves. Elsewhere, Ada Boost mostly does well but trips up on the hyperbola.
  5. For some reason, Lasso fails to give an output (in the true spirit of pile-stirring, I didn’t explore why). Ridge is again a regression method and so does badly on this non-linear dataset
  6. Neural Networks (Multi Layer Perceptron to be precise) does reasonably well, but can’t figure out the sharp edges of the polygons.
  7. Decision trees again do rather well. I’m pleasantly surprised that they pick up and classify the disjoint sets (multi-circle and hyperbola) correctly. Maybe it’s the way scikit learn implements them?

Of course, the datasets that one comes across in real life are never such simple geometrical figures, but I hope that this set can give you some idea on what techniques to use where.

At least I hope that this makes you think about the suitability of different techniques for the data rather than simply applying all the techniques you know and then picking the one that performs best on your given training and test data.

That would count as nothing different from p-hacking, and there’s an XKCD for that as well!

Source: https://xkcd.com/882/

Machine learning and degrees of freedom

For starters, machine learning is not magic. It might appear like magic when you see Google Photos automatically tagging all your family members correctly, down to the day of their birth. It might appear so when Siri or Alexa give a perfect response to your request. And the way AlphaZero plays chess is almost human!

But no, machine learning is not magic. I’d made a detailed argument about that in the second edition of my newsletter (subscribe if you haven’t already!).

One way to think of it is that the output of a machine learning model (which could be anything from “does this picture contain a cat?” to “is the speaker speaking in English?”) is the result of a mathematical formula, whose parameters are unknown at the beginning of the exercise.

As the system gets “trained” (of late I’ve avoided using the word “training” in the context of machine learning, preferring to use “calibration” instead. But anyway…), the hitherto unknown parameters of the formula get adjusted in a manner that the formula output matches the given data. Once the system has “seen” enough data, we have a model, which can then be applied on unknown data (I’m completely simplifying it here).

The genius in machine learning comes in setting up mathematical formulae in a way that given input-output pairs of data can be used to adjust the parameters of the formulae. The genius in deep learning, which has been the rage this decade, for example, comes from a 30-year old mathematical breakthrough called “back propagation”. The reason it took until a few years back for it to become a “thing” has to do with data availability, and compute power (check this terrific piece in the MIT Tech Review about deep learning).

Within machine learning, the degree of complexity of a model can vary significantly. In an ordinary univariate least squares regression, for example, there are only two parameters the system can play with (slope and intercept of the regression line). Even a simple “shallow” neural network, on the other hand, has thousands of parameters.

Because a regression has so few parameters, the kind of patterns that the system can detect is rather limited (whatever you do, the system can only draw a line. Nothing more!). Thus, regression is applied only when you know that the relationship that exists is simple (and linear), or when you are trying to force-fit a linear model.

The upside of simple models such as regression is that because there are so few parameters to be adjusted, you need relatively few data points in order to adjust them to the required degree of accuracy.

As models get more and more complicated, the number of parameters increases, thus increasing the complexity of patterns that can be detected by the system. Close to one extreme, you have systems that see lots of current pictures of you and then identify you in your baby pictures.

Such complicated patterns can be identified because the system parameters have lots of degrees of freedom. The downside, of course, is that because the parameters start off having so much freedom, it takes that much more data to “tie them down”. The reason Google Photos can tag you in your baby pictures is partly down to the quantum of image data that Google has, which does an effective job of tying down the parameters. Google Translate similarly uses large repositories of multi-lingual text in order to “learn languages”.

Like most other things in life, machine learning also involves a tradeoff. It is possible for systems to identify complex patterns, but for that you need to start off with lots of “degrees of freedom”, and then use lots of data to tie down the variables. If your data is small, then you can only afford a small number of parameters, and that limits the complexity of patterns that can be detected.

One way around this, of course, is to use your own human intelligence as a pre-processing step in order to set up parameters in a way that they can be effectively tuned by data. Gopi had a nice post recently on “neat learning versus deep learning“, which is relevant in this context.

Finally, there is the issue of spurious correlations. Because machine learning systems are basically mathematical formulae designed to learn patterns from data, spurious correlations in the input dataset can lead to the system learning random things, which can hamper its predictive power.

Data sets, especially ones that have lots of dimensions, can display correlations that appear at random, but if the input dataset shows enough of these correlations, the system will “learn” them as a pattern, and try to use them in predictions. And the more complicated your model gets, the harder it is to know what it is doing, and thus the harder it is to identify these spurious correlations!

And the thing with having too many “free parameters” (lots of degrees of freedom but without enough data to tie down the parameters) is that these free parameters are especially susceptible to learning the spurious correlations – for they have no other job.

Thinking about it, after all, machine learning systems are not human!

Nested Ternary Operators

It’s nearly twenty years since I first learnt to code, and the choice of language (imposed by school, among others) then was C. One of the most fascinating things about C was what was simply called the “ternary operator”, which is kinda similar to the IF statement in Excel, ifelse statement in R and np.where statement in Python.

Basically the ternary operator consisted of a ‘?’ and a ‘:’. It was a statement that took the form of “if this then that else something else”. So, for example, if you had two variables a and b, and had to return the maximum of them, you could use the ternary operator to say a>b?a:b.

Soon I was attending programming contests, where there would be questions on debugging programs. These would inevitably contain one question on ternary operators. A few years later I started attending job interviews for software engineering positions. The ternary operator questions were still around, except that now it would be common to “nest” ternary operators (include one inside the other). It became a running joke that the only place you’d see nested ternary operators was in software engineering interviews.

The thing with the ternary operator is that while it allows you to write your program in fewer lines of code and make it seem more concise, it makes the code a lot less readable. This in turn makes it hard for people to understand your code, and thus makes it hard to debug. In that sense, using the operator while coding in C is not considered particularly good practice.

It’s nearly 2018 now, and C is not used that much nowadays, so the ternary operator, and the nested ternary operator, have made their exit – even from programming interviews if I’m not wrong. However, people still continue to maintain this practice of writing highly optimised code.

Now, every programmer who thinks he’s a good programmer likes to write efficient code. There’s this sense of elegance about code written in a rather elegant manner, using only a few lines. Sometimes such elegant code is also more efficient, speeding up computation and consuming less memory (think, for example, vectorised operations in R).

The problem, however, is that such elegance comes with a tradeoff with readability. The more optimised a piece of code is, the harder it is for someone else to understand it, and thus the harder it is to debug. And the more complicated the algorithm being coded, the worse it gets.

It makes me think that the reason all those ternary operators used to appear in those software engineering interviews (FYI I’ve never done a software engineering job) is to check if you’re able to read complicated code that others write!

AlphaZero defeats Stockfish: Quick thoughts

The big news of the day, as far as I’m concerned, is the victory of Google Deepmind’s AlphaZero over Stockfish, currently the highest rated chess engine. This comes barely months after Deepmind’s AlphaGo Zero had bested the earlier avatar of AlphaGo in the game of Go.

Like its Go version, the AlphaZero chess playing machine learnt using reinforcement learning (I remember doing a term paper on the concept back in 2003 but have mostly forgotten). Basically it wasn’t given any “training data”, but the machine trained itself on continuously playing with itself, with feedback given in each stage of learning helping it learn better.

After only about four hours of “training” (basically playing against itself and discovering moves), AlphaZero managed to record this victory in a 100-game match, winning 28 and losing none (the rest of the games were drawn).

There’s a sample game here on the Chess.com website and while this might be a biased sample (it’s likely that the AlphaZero engineers included the most spectacular games in their paper, from which this is taken), the way AlphaZero plays is vastly different from the way engines such as Stockfish have been playing.

I’m not that much of a chess expert (I “retired” from my playing career back in 1994), but the striking things for me from this game were

  • the move 7. d5 against the Queen’s Indian
  • The piece sacrifice a few moves later that was hard to see
  • AlphaZero’s consistent attempts until late in the game to avoid trading queens
  • The move Qh1 somewhere in the middle of the game

In a way (and being consistent with some of the themes of this blog), AlphaZero can be described as a “stud” chess machine, having taught itself to play based on feedback from games it’s already played (the way reinforcement learning broadly works is that actions that led to “good rewards” are incentivised in the next iteration, while those that led to “poor rewards” are penalised. The challenge in this case is to set up chess in a way that is conducive for a reinforcement learning system).

Engines such as StockFish, on the other hand, are absolute “fighters”. They get their “power” by brute force, by going down nearly all possible paths in the game several moves down. This is supplemented by analysis of millions of existing games of various levels which the engine “learns” from – among other things, it learns how to prune and prioritise the paths it searches on. StockFish is also fed a database of chess openings which it remembers and tries to play.

What is interesting is that AlphaZero has “discovered” some popular chess openings through the course of is self-learning. It is interesting to note that some popular openings such as the King’s Indian or French find little favour with this engine, while others such as the Queen’s Gambit or the Queen’s Indian find favour. This is a very interesting development in terms of opening theory itself.

Frequency of openings over time employed by AlphaZero in its “learning” phase. Image sourced from AlphaZero research paper.

In any case, my immediate concern from this development is how it will affect human chess. Over the last decade or two, engines such as stockfish have played a profound role in the development of chess, with current top players such as Magnus Carlsen or Sergey Karjakin having trained extensively with these engines.

The way top grandmasters play has seen a steady change in these years as they have ingested the ideas from engines such as StockFish. The game has become far more quiet and positional, as players seek to gain small advantages which steadily improves over the course of (long) games. This is consistent with the way the engines that players learn from play.

Based on the evidence of the one game I’ve seen of AlphaZero, it plays very differently from the existing engines. Based on this, it will be interesting to see how human players who train with AlphaZero based engines (or their clones) will change their game.

Maybe chess will turn back to being a bit more tactical than it’s been in the last decade? It’s hard to say right now!