Statistics and machine learning approaches

A couple of years back, I was part of a team that delivered a workshop in machine learning. Given my background, I had been asked to do a half-day session on Regression, and was told that the standard software package being used was the scikit-learn package in python.

Both the programming language and the package were new to me, so I dug around a few days before the workshop, trying to figure out regression. Despite my best efforts, I couldn’t locate how to find out the R^2. What some googling told me was surprising:

There exists no R type regression summary report in sklearn. The main reason is that sklearn is used for predictive modelling / machine learning and the evaluation criteria are based on performance on previously unseen data

As it happened, I requested the students at the workshop to install a package called statsmodels, which provides standard regression outputs. And then I proceeded to lecture to them on regression as I know it, including significance scores, p values, t statistics, multicollinearity and the likes. It was only much later was I to figure out that that is now how regression (and logistic regression) is done in the machine learning world.

In a statistical framework, the data sets in regression are typically “long” – you have a large number of data points, and a small number of variables. Putting it differently, we start off with a model with few degrees of freedom, and then “constrain” the variables with a large enough number of data points, so that if a signal exists, and it is in the right format (linear relationship and all that), we can pin it down effectively.

In a machine learning framework, it is common to run a regression where the number of data points is of the same order of magnitude as, or even smaller than the number of variables. Strictly speaking, such a problem is unbounded (there are too many degrees of freedom), and so regression is not well-defined. Instead, we rely upon “regularisation methods” to “tie down” the variables and (hopefully) produce a consistent solution.

Moreover, machine learning approaches are common to problems where individual predictor variables don’t have meaning. In this scenario, knowing whether a particular variable is significant or not is of no utility. Then, the signal in machine learning lies in the combination of variables, which means that multicollinearity (correlation between predictor variables) is not really a bad thing as it is in statistics. Variables not having meanings means that there are no correlations per se to be defined, and so machine learning models are harder to interpret, and are more likely to have hidden spurious correlations.

Also, when you have a small number of variables and a large number of data points, it is easy to get an “exact solution” for regression, which is what statistical methods use. In a machine learning framework with “wide” data, though, exact solutions are computationally infeasible, and so you need to use approximate algorithms such as gradient descent – which are common across ML techniques.

All in all, while statistics and machine learning might use techniques with the same name (“regression”, for example), they are both in theory and practice, very different ways to solve the problem. The important thing is to figure out the approach most suited for a particular problem, and use it accordingly.

Meaningful and meaningless variables (and correlations)

A number of data scientists I know like to go about their business in a domain-free manner. They make a conscious choice to not know anything about the domain in which they are solving the problem, and instead treat a dataset as just a set of anonymised data, and attack it with the usual methods.

I used to be like this as well a long time ago. I remember in my very first job I had pissed off some clients by claiming that “I don’t care if this is a nut or a screw. As far as I’m concerned this is just a part number”.

Over time, though, I’ve come to realise that even a little bit of domain knowledge or intuition can help build significantly superior models. To use a framework I had introduced a few months back, your domain knowledge can be used to restrict the degrees of freedom in your model, thus increasing how much the machine can learn with the available data.

Then again, some problems lend themselves better to domain-based intuition than others, and this has to do with the meaning of a data point.

Consider two fairly popular problem statements from data science – determining whether a borrower will pay back a loan, and determining whether there is a cat in a given picture. While at the surface level, both are binary decisions, to be made by looking at large dimensional data (the number of data points that can be used for credit scoring can be immense), there is an important distinction between the two problems.

In the cat picture case, a single data point is basically the colour of a single pixel in an image, and it doesn’t really mean anything. If we were to try and build a cat recognition algorithm based on a single pre-chosen pixel in an image, it is unlikely we can do better than noise. Instead, the information is encoded in groups of pixels near each other – a bunch of pixels that look like cat ears, for example. In this case, whether you are training to model to identify cats or cinnamon buns is immaterial, and the domain-free approach works well.

With the credit scoring problem, the amount of information in each explanatory variable is significant. Unless we are looking at some extremely esoteric or insignificant variables (trust me, these get used fairly often in credit scoring models), it is possible to build a decision model based on just one explanatory variable and still have significant predictive power. There is definitely information in correlation between explanatory variables, but that pales compared to the information in the variables themselves.

And the amount of information captured by each explanatory variable means that it makes sense in these cases to invest some human effort to understand the variables and the impact it is having. In some cases, you might decide to use a mathematical transformation of a variable (square or log or inverse) instead of the variable itself. In other cases, you might determine based on logic that some correlations are spurious and drop the variables altogether. You might see a few explanatory variables with largely similar information and decide to drop some of them or use dimension reduction algorithms. And you can do a much better job of this if you have some experience or intuition about the domain, and care to understand what each variable means. Because variables have meanings.

Unlike in the image recognition problem, where most of the intuition is in the correlation term, because of which the “variables” don’t have any meaning, where domain doesn’t matter that much (though it can – in that some kinds of algorithms are superior at some kinds of images. I don’t have much experience in this domain to comment 🙂 ).

Again like in all the two-by-twos that I produce (and there are many, though this is arguably the most famous one), the problem is where you take people from one side and put them in a situation from the other side.

If you come from a background where you’ve mostly dealt with datasets where each individual variable is meaningless, but there is information in the collective, you are likely to “stir the pile” rather than using intuition to build better models.

If you are used to dealing with datasets with “meaning”, where variables hold the information, you might waste time doing your jiggery-pokery when you should be looking to apply models that get information in the collective.

The problem is this is a rather esoteric classification, so there is plenty of chance for people to be thrown into the wrong end.

Statistics and machine learning

So a group of statisticians (from Cyprus and Greece) have written an easy-to-read paper comparing statistical and machine learning methods in time series forecasting, and found that statistical methods do better, both in terms of accuracy and computational complexity.

To me, there’s no surprise in the conclusion, since in the statistical methods, there is some human intelligence involved, in terms of removing seasonality, making the time series stationary and then using statistical methods that have been built specifically for time series forecasting (including some incredibly simple stuff like exponential smoothing).

Machine learning methods, on the other hand, are more general purpose – the same neural networks used for forecasting these time series, with changed parameters, can be used for predicting something else.

In a way, using machine learning for time series forecasting is like using that little screwdriver from a Swiss army knife, rather than a proper screwdriver. Yes, it might do the job, but it’s in general inefficient and not an effective use of resources.

Yet, it is important that this paper has been written since the trend in industry nowadays has been that given cheap computing power, machine learning be used for pretty much any problem, irrespective of whether it is the most appropriate method for doing so. You also see the rise of “machine learning purists” who insist that no human intelligence should “contaminate” these models, and machines should do everything.

By pointing out that statistical techniques are superior at time series forecasting compared to general machine learning techniques, the authors bring to attention that using purpose-built techniques can actually do much better, and that we can build better systems by using a combination of human and machine intelligence.

They also helpfully include this nice picture that summarises what machine learning is good for, and I wholeheartedly agree: 

The paper also has some other gems. A few samples here:

Knowing that a certain sophisticated method is not as accurate as a much simpler one is upsetting from a scientific point of view as the former requires a great deal of academic expertise and ample computer time to be applied.

 

[…] the post-sample predictions of simple statistical methods were found to be at least as accurate as the sophisticated ones. This finding was furiously objected to by theoretical statisticians [76], who claimed that a simple method being a special case of e.g. ARIMA models, could not be more accurate than the ARIMA one, refusing to accept the empirical evidence proving the opposite.

 

A problem with the academic ML forecasting literature is that the majority of published studies provide forecasts and claim satisfactory accuracies without comparing them with simple statistical methods or even naive benchmarks. Doing so raises expectations that ML methods provide accurate predictions, but without any empirical proof that this is the case.

 

At present, the issue of uncertainty has not been included in the research agenda of the ML field, leaving a huge vacuum that must be filled as estimating the uncertainty in future predictions is as important as the forecasts themselves.

Beer and diapers: Netflix edition

When we started using Netflix last May, we created three personas for the three of us in the family – “Karthik”, “Priyanka” and “Berry”. At that time we didn’t realise that there was already a pre-created “kids” (subsequently renamed “children” – don’t know why that happened) persona there.

So while Priyanka and I mostly use our respective personas to consume Netflix (our interests in terms of video content hardly intersect), Berry uses both her profile and the kids profile for her stuff (of course, she’s too young to put it on herself. We do it for her). So over the year, the “Berry” profile has been mostly used to play Peppa Pig, and the occasional wildlife documentary.

Which is why we were shocked the other day to find that “Real life wife swap” had been recommended on her account. Yes, you read that right. We muttered a word of abuse about Netflix’s machine learning algorithms and since then have only used the “kids” profile to play Berry’s stuff.

Since then I’ve been wondering what made Netflix recommend “real life wife swap” to Berry. Surely, it would have been clear to Netflix that while it wasn’t officially classified as one, the Berry persona was a kid’s account? And even if it didn’t, didn’t the fact that the account was used for watching kids’ stuff lead the collaborative filtering algorithms at Netflix to recommend more kids’ stuff? I’ve come up with various hypotheses.

Since I’m not Netflix, and I don’t have their data, I can’t test it, but my favourite hypothesis so far involves what is possibly the most commonly cited example in retail analytics – “beer and diapers“. In this most-likely-apocryphal story, a supermarket chain discovered that beer and diapers were highly likely to appear together in shopping baskets. Correlation led to causation and a hypothesis was made that this was the result of tired fathers buying beer on their diaper shopping trips.

So the Netflix version of beer-and-diapers, which is my hypothesis, goes like this. Harrowed parents are pestered by their kids to play Peppa Pig and other kiddie stuff. The parents are so stressed that they don’t switch to the kid’s persona, and instead play Peppa Pig or whatever from their own accounts. The kid is happy and soon goes to bed. And then the parent decides to unwind by watching some raunchy stuff like “real life wife swap”.

Repeat this story in enough families, and you have a strong enough pattern that accounts not explicitly classified as “kids/children” have strong activity of both kiddie stuff and adult content. And when you use an account not explicitly mentioned as “kids” to watch kiddie stuff, it gets matched to these accounts that have created the pattern – Netflix effectively assumes that watching kid stuff on an adult account indicates that the same account is used to watch adult content as well. And so serves it to Berry!

Machine learning algorithms basically work on identifying patterns in data, and then fitting these patterns on hitherto unseen data. Sometimes the patterns make sense – like Google Photos identifying you even in your kiddie pics. Other times, the patterns are offensive – like the time Google Photos classified a black woman as a “gorilla“.

Thus what is necessary is some level of human oversight, to make sure that the patterns the machine has identified makes some sort of sense (machine learning purists say this is against the spirit of machine learning, since one of the purposes of machine learning is to discover patterns not perceptible to humans).

That kind of oversight at Netflix would have suggested that you can’t tag a profile to a “kiddie content AND adult content” category if the profile has been used to watch ONLY kiddie content (or ONLY adult content). And that kind of oversight would have also led Netflix to investigate issues of users using “general” account for their kids, and coming up with an algorithm to classify such accounts as kids’ accounts, and serve only kids’ content there.

It seems, though, that algorithms run supreme at Netflix, and so my baby daughter gets served “real life wife swap”. Again, this is all a hypothesis (real life wife swap being recommended is a fact, of course)!

Stirring the pile efficiently

Warning: This is a technical post, and involves some code, etc. 

As I’ve ranted a fair bit on this blog over the last year, a lot of “machine learning” in the industry can be described as “stirring the pile”. Regular readers of this blog will be familiar with this image from XKCD by now:

Source: https://xkcd.com/1838/

Basically people simply take datasets and apply all the machine learning techniques they have heard of (implementation is damn easy – scikit learn allows you to implement just about any model in three similar looking lines of code; See my code here to see how similar the implementation is).

So I thought I’ll help these pile-stirrers by giving some hints of what method to use for different kinds of data. I’ve over-simplified stuff, and so assume that:

  1. There are two predictor variables X and Y. The predicted variable “Z” is binary.
  2. X and Y are each drawn from a standard normal distribution.
  3. The predicted variable Z is “clean” – there is a region in the X-Y plane where Z is always “true” and another region where Z is always “false”
  4. So the idea is to see which machine learning techniques are good at identifying which kind of geometrical figures.
  5. Everything is done “in-sample”. Given the nature of the data, it doesn’t matter if we do it in-sample or out-of-sample.

For those that understand Python (and every pile-stirrer worth his salt is excellent at Python), I’ve put my code in a nice Jupyter Notebook, which can be found here.

So this is what the output looks like. The top row shows the “true values” of Z. Then we have a row for each of the techniques we’ve used, which shows how well these techniques can identify the pattern given in the top row (click on the image for full size).

As you can see, I’ve chosen some common geometrical shapes and seen which methods are good at identifying those. A few pertinent observations:

  1. Logistic regression and linear SVM are broadly similar, and both are shit for this kind of dataset. Being linear models, they fail to deal with non-linear patterns
  2. SVM with RBF kernel is better, but it fails when there are multiple “true regions” in the dataset. At least it’s good at figuring out some non-linear patterns. However, it can’t figure out the triangle or square – it draws curves around them, instead.
  3. Naive Bayesian (I’ve never understood this even though I’m pretty good at Bayesian statistics, but I understand this is a commonly used technique; and I’ve used default parameters so not sure how it is “Bayesian” even) can identify some stuff but does badly when there are disjoint regions where Z is true.
  4. Ensemble methods such as Random Forests and Gradient Boosting do rather well on all the given inputs. They do well for both polygons and curves. Elsewhere, Ada Boost mostly does well but trips up on the hyperbola.
  5. For some reason, Lasso fails to give an output (in the true spirit of pile-stirring, I didn’t explore why). Ridge is again a regression method and so does badly on this non-linear dataset
  6. Neural Networks (Multi Layer Perceptron to be precise) does reasonably well, but can’t figure out the sharp edges of the polygons.
  7. Decision trees again do rather well. I’m pleasantly surprised that they pick up and classify the disjoint sets (multi-circle and hyperbola) correctly. Maybe it’s the way scikit learn implements them?

Of course, the datasets that one comes across in real life are never such simple geometrical figures, but I hope that this set can give you some idea on what techniques to use where.

At least I hope that this makes you think about the suitability of different techniques for the data rather than simply applying all the techniques you know and then picking the one that performs best on your given training and test data.

That would count as nothing different from p-hacking, and there’s an XKCD for that as well!

Source: https://xkcd.com/882/

Astrology and Data Science

The discussion goes back some 6 years, when I’d first started setting up my data and management consultancy practice. Since I’d freshly quit my job to set up the said practice, I had plenty of time on my hands, and the wife suggested that I spend some of that time learning astrology.

Considering that I’ve never been remotely religious or superstitious, I found this suggestion preposterous (I had a funny upbringing in the matter of religion – my mother was insanely religious (including following a certain Baba), and my father was insanely rationalist, and I kept getting pulled in both directions).

Now, the wife has some (indirect) background in astrology. One of her aunts is an astrologer, and specialises in something called “prashNa shaastra“, where the prediction is made based on the time at which the client asks the astrologer a question. My wife believes this has resulted in largely correct predictions (though I suspect a strong dose of confirmation bias there), and (very strangely to me) seems to believe in the stuff.

“What’s the use of studying astrology if I don’t believe in it one bit”, I asked. “Astrology is very mathematical, and you are very good at mathematics. So you’ll enjoy it a lot”, she countered, sidestepping the question.

We went off into a long discussion on the origins of astrology, and how it resulted in early developments in astronomy (necessary in order to precisely determine the position of planets), and so on. The discussion got involved, and involved many digressions, as discussions of this sort might entail. And as you might expect with such discussions, my wife threw a curveball, “You know, you say you’re building a business based on data analysis. Isn’t data analysis just like astrology?”

I was stumped (ok I know I’m mixing metaphors here), and that had ended the discussion then.

Until I decided to bring it up recently. As it turns out, once again (after a brief hiatus when I decided I’ll do a job) I’m in process of setting up a data and management consulting business. The difference is this time I’m in London, and that “data science” is a thing (it wasn’t in 2011). And over the last year or so I’ve been kinda disappointed to see what goes on in the name of “data science” around me.

This XKCD cartoon (which I’ve shared here several times) encapsulates it very well. People literally “pour data into a machine learning system” and then “stir the pile” hoping for the results.

Source: https://xkcd.com/1838/

In the process of applying fairly complex “machine learning” algorithms, I’ve seen people not really bother about whether the analysis makes intuitive sense, or if there is “physical meaning” in what the analysis says, or if the correlations actually determine causation. It’s blind application of “run the data through a bunch of scikit learn models and accept the output”.

And this is exactly how astrology works. There are a bunch of predictor variables (position of different “planets” in various parts of the “sky”). There is the observed variable (whether some disaster happened or not, basically), which is nicely in binary format. And then some of our ancients did some data analysis on this, trying to identify combinations of predictors that predicted the output (unfortunately they didn’t have the power of statistics or computers, so in that sense the models were limited). And then they simply accepted the outputs, without challenging why it makes sense that the position of Jupiter at the time of wedding affects how your marriage will go.

So I brought up the topic of astrology and data science again recently, saying “OK after careful analysis I admit that astrology is the oldest form of data science”. “That’s not what I said”, the wife countered. “I said that data science is new age astrology, and not the other way round”.

It’s hard to argue with that!

Machine learning and degrees of freedom

For starters, machine learning is not magic. It might appear like magic when you see Google Photos automatically tagging all your family members correctly, down to the day of their birth. It might appear so when Siri or Alexa give a perfect response to your request. And the way AlphaZero plays chess is almost human!

But no, machine learning is not magic. I’d made a detailed argument about that in the second edition of my newsletter (subscribe if you haven’t already!).

One way to think of it is that the output of a machine learning model (which could be anything from “does this picture contain a cat?” to “is the speaker speaking in English?”) is the result of a mathematical formula, whose parameters are unknown at the beginning of the exercise.

As the system gets “trained” (of late I’ve avoided using the word “training” in the context of machine learning, preferring to use “calibration” instead. But anyway…), the hitherto unknown parameters of the formula get adjusted in a manner that the formula output matches the given data. Once the system has “seen” enough data, we have a model, which can then be applied on unknown data (I’m completely simplifying it here).

The genius in machine learning comes in setting up mathematical formulae in a way that given input-output pairs of data can be used to adjust the parameters of the formulae. The genius in deep learning, which has been the rage this decade, for example, comes from a 30-year old mathematical breakthrough called “back propagation”. The reason it took until a few years back for it to become a “thing” has to do with data availability, and compute power (check this terrific piece in the MIT Tech Review about deep learning).

Within machine learning, the degree of complexity of a model can vary significantly. In an ordinary univariate least squares regression, for example, there are only two parameters the system can play with (slope and intercept of the regression line). Even a simple “shallow” neural network, on the other hand, has thousands of parameters.

Because a regression has so few parameters, the kind of patterns that the system can detect is rather limited (whatever you do, the system can only draw a line. Nothing more!). Thus, regression is applied only when you know that the relationship that exists is simple (and linear), or when you are trying to force-fit a linear model.

The upside of simple models such as regression is that because there are so few parameters to be adjusted, you need relatively few data points in order to adjust them to the required degree of accuracy.

As models get more and more complicated, the number of parameters increases, thus increasing the complexity of patterns that can be detected by the system. Close to one extreme, you have systems that see lots of current pictures of you and then identify you in your baby pictures.

Such complicated patterns can be identified because the system parameters have lots of degrees of freedom. The downside, of course, is that because the parameters start off having so much freedom, it takes that much more data to “tie them down”. The reason Google Photos can tag you in your baby pictures is partly down to the quantum of image data that Google has, which does an effective job of tying down the parameters. Google Translate similarly uses large repositories of multi-lingual text in order to “learn languages”.

Like most other things in life, machine learning also involves a tradeoff. It is possible for systems to identify complex patterns, but for that you need to start off with lots of “degrees of freedom”, and then use lots of data to tie down the variables. If your data is small, then you can only afford a small number of parameters, and that limits the complexity of patterns that can be detected.

One way around this, of course, is to use your own human intelligence as a pre-processing step in order to set up parameters in a way that they can be effectively tuned by data. Gopi had a nice post recently on “neat learning versus deep learning“, which is relevant in this context.

Finally, there is the issue of spurious correlations. Because machine learning systems are basically mathematical formulae designed to learn patterns from data, spurious correlations in the input dataset can lead to the system learning random things, which can hamper its predictive power.

Data sets, especially ones that have lots of dimensions, can display correlations that appear at random, but if the input dataset shows enough of these correlations, the system will “learn” them as a pattern, and try to use them in predictions. And the more complicated your model gets, the harder it is to know what it is doing, and thus the harder it is to identify these spurious correlations!

And the thing with having too many “free parameters” (lots of degrees of freedom but without enough data to tie down the parameters) is that these free parameters are especially susceptible to learning the spurious correlations – for they have no other job.

Thinking about it, after all, machine learning systems are not human!