classification – Pertinent Observations

Mo Salah and Machine Learning

First of all, I’m damn happy that Mo Salah has renewed his Liverpool contract. With Sadio Mane also leaving, the attack was looking a bit thin (I was distinctly unhappy with the Jota-Mane-Diaz forward line we used in the Champions League final. Lacked cohesion). Nunez is still untested in terms of “leadership”, and without Salah that would’ve left Firmino as the only “attacking leader”.

(non-technical readers can skip the section in italics and still make sense of this post)

Now that this is out of the way, I’m interested in seeing one statistic (for which I’m pretty sure I don’t have the data). For each of the chances that Salah has created, I want to look at the xG (expected goals) and whether he scored or not. And then look at a density plot of xG for both categories (scored or not).

For most players, this is likely to result in two very distinct curves – they are likely to score from a large % of high xG chances, and almost not score at all from low xG chances. For Salah, though, the two density curves are likely to be a lot closer.

What I’m saying is – most strikers score well from easy chances, and fail to score from difficult chances. Salah is not like that. On the one hand, he creates and scores some extraordinary goals out of nothing (low xG). On the other, he tends to miss a lot of seemingly easy chances (high xG).

In fact, it is quite possible to look at a player like Salah, see a few sitters that he has missed (he misses quite a few of them), and think he is a poor forward. And if you look at a small sample of data (or short periods of time) you are likely to come to the same conclusion. Look at the last 3-4 months of the 2021-22 season. The consensus among pundits then was that Salah had become poor (and on Reddit, you could see Liverpool fans arguing that we shouldn’t give him a lucrative contract extension since ‘he has lost it’).

It is well possible that this is exactly the conclusion Jose Mourinho came to back in 2013-14 when he managed Salah at Chelsea (and gave him very few opportunities). The thing with a player like Salah is that he is so unpredictable that it is very possible to see samples and think he is useless.

Of late, I’ve been doing (rather, supervising (and there is no pun intended) ) a lot of machine learning work. A lot of this has to do with binary classification – classifying something as either a 0 or a 1. Data scientists build models, which give out a probability score that the thing is a 1, and then use some (sometimes arbitrary) cutoff to determine whether the thing is a 0 or a 1.

There are a bunch of metrics in data science on how good a model is, and it all comes down to what the model predicted and what “really” happened. And I’ve seen data scientists work super hard to improve on these accuracy measures. What can be done to predict a little bit better? Why is this model only giving me 77% ROC-AUC when for the other problem I was able to get 90%?

The thing is – if the variable you are trying to predict is something like whether Salah will score from a particular chance, your accuracy metric will be really low indeed. Because he is fundamentally unpredictable. It is the same with some of the machine learning stuff – a lot of models are trying to predict something that is fundamentally unpredictable, so there is a limit on how accurate the model will get.

The problem is that you would have come across several problem statements that are much more predictable that you think it is a problem with you (or your model) that you can’t predict better. Pundits (or Jose) would have seen so many strikers who predictably score from good chances that they think Salah is not good.

The solution in these cases is to look at aggregates. Looking for each single prediction will not take us anywhere. Instead, can we predict over a large set of data whether we broadly got it right? In my “research” for this blogpost, I found this.

Last season, on average, Salah scored precisely as many goals as the model would’ve predicted! You might remember stunners like the one against Manchester City at Anfield. So you know where things got averaged out.

Meaningful and meaningless variables (and correlations)

A number of data scientists I know like to go about their business in a domain-free manner. They make a conscious choice to not know anything about the domain in which they are solving the problem, and instead treat a dataset as just a set of anonymised data, and attack it with the usual methods.

I used to be like this as well a long time ago. I remember in my very first job I had pissed off some clients by claiming that “I don’t care if this is a nut or a screw. As far as I’m concerned this is just a part number”.

Over time, though, I’ve come to realise that even a little bit of domain knowledge or intuition can help build significantly superior models. To use a framework I had introduced a few months back, your domain knowledge can be used to restrict the degrees of freedom in your model, thus increasing how much the machine can learn with the available data.

Then again, some problems lend themselves better to domain-based intuition than others, and this has to do with the meaning of a data point.

Consider two fairly popular problem statements from data science – determining whether a borrower will pay back a loan, and determining whether there is a cat in a given picture. While at the surface level, both are binary decisions, to be made by looking at large dimensional data (the number of data points that can be used for credit scoring can be immense), there is an important distinction between the two problems.

In the cat picture case, a single data point is basically the colour of a single pixel in an image, and it doesn’t really mean anything. If we were to try and build a cat recognition algorithm based on a single pre-chosen pixel in an image, it is unlikely we can do better than noise. Instead, the information is encoded in groups of pixels near each other – a bunch of pixels that look like cat ears, for example. In this case, whether you are training to model to identify cats or cinnamon buns is immaterial, and the domain-free approach works well.

With the credit scoring problem, the amount of information in each explanatory variable is significant. Unless we are looking at some extremely esoteric or insignificant variables (trust me, these get used fairly often in credit scoring models), it is possible to build a decision model based on just one explanatory variable and still have significant predictive power. There is definitely information in correlation between explanatory variables, but that pales compared to the information in the variables themselves.

And the amount of information captured by each explanatory variable means that it makes sense in these cases to invest some human effort to understand the variables and the impact it is having. In some cases, you might decide to use a mathematical transformation of a variable (square or log or inverse) instead of the variable itself. In other cases, you might determine based on logic that some correlations are spurious and drop the variables altogether. You might see a few explanatory variables with largely similar information and decide to drop some of them or use dimension reduction algorithms. And you can do a much better job of this if you have some experience or intuition about the domain, and care to understand what each variable means. Because variables have meanings.

Unlike in the image recognition problem, where most of the intuition is in the correlation term, because of which the “variables” don’t have any meaning, where domain doesn’t matter that much (though it can – in that some kinds of algorithms are superior at some kinds of images. I don’t have much experience in this domain to comment 🙂 ).

Again like in all the two-by-twos that I produce (and there are many, though this is arguably the most famous one), the problem is where you take people from one side and put them in a situation from the other side.

If you come from a background where you’ve mostly dealt with datasets where each individual variable is meaningless, but there is information in the collective, you are likely to “stir the pile” rather than using intuition to build better models.

If you are used to dealing with datasets with “meaning”, where variables hold the information, you might waste time doing your jiggery-pokery when you should be looking to apply models that get information in the collective.

The problem is this is a rather esoteric classification, so there is plenty of chance for people to be thrown into the wrong end.

Introverts and extroverts

I find the classification of people into introverts and extroverts to be rather simplistic. While it is bad enough that people are commonly classified into one of these, you also have metrics such as the Myers Briggs Type Indicator (MBTI) that formalise this classification, with top consulting firms actively using such classifications in their day-to-day work.

What makes introvert-extrovert thing complex is that it is not even a spectrum between introversion and extroversion – you can’t say, for example, that you’re “20% introvert and 80% extrovert”. So you can’t even convert the binary classification into a scale.

The thing is that introversion and extroversion is context sensitive. For example, I like to socialise by talking to people (I HATE “catching up” in cinema halls or loud bars, since they don’t allow conversation). In terms of work, though, I largely prefer to be left alone. Even within that, I sometimes like to talk to people when I’m ideating but wholly want to be left alone when I’m executing on something.

And with each person, there might be different contexts in which they might derive energy from people around them, and contexts where they might want to be left alone. And within each context, whether they want to be with or without people is probabilistic, without a good classifier telling when they want to be how.

So introversion or extroversion is a rather large and complex set of personality traits that people have tried to force-fit not only on one axis, but also into binary classifications. And with it being part of management theory as practiced by top strategy consulting firms, it’s simply sad.