Mo Salah and Machine Learning

First of all, I’m damn happy that Mo Salah has renewed his Liverpool contract. With Sadio Mane also leaving, the attack was looking a bit thin (I was distinctly unhappy with the Jota-Mane-Diaz forward line we used in the Champions League final. Lacked cohesion). Nunez is still untested in terms of “leadership”, and without Salah that would’ve left Firmino as the only “attacking leader”.

(non-technical readers can skip the section in italics and still make sense of this post)

Now that this is out of the way, I’m interested in seeing one statistic (for which I’m pretty sure I don’t have the data). For each of the chances that Salah has created, I want to look at the xG (expected goals) and whether he scored or not. And then look at a density plot of xG for both categories (scored or not). 

For most players, this is likely to result in two very distinct curves – they are likely to score from a large % of high xG chances, and almost not score at all from low xG chances. For Salah, though, the two density curves are likely to be a lot closer.

What I’m saying is – most strikers score well from easy chances, and fail to score from difficult chances. Salah is not like that. On the one hand, he creates and scores some extraordinary goals out of nothing (low xG). On the other, he tends to miss a lot of seemingly easy chances (high xG).

In fact, it is quite possible to look at a player like Salah, see a few sitters that he has missed (he misses quite a few of them), and think he is a poor forward. And if you look at a small sample of data (or short periods of time) you are likely to come to the same conclusion. Look at the last 3-4 months of the 2021-22 season. The consensus among pundits then was that Salah had become poor (and on Reddit, you could see Liverpool fans arguing that we shouldn’t give him a lucrative contract extension since ‘he has lost it’).

It is well possible that this is exactly the conclusion Jose Mourinho came to back in 2013-14 when he managed Salah at Chelsea (and gave him very few opportunities). The thing with a player like Salah is that he is so unpredictable that it is very possible to see samples and think he is useless.

Of late, I’ve been doing (rather, supervising (and there is no pun intended) ) a lot of machine learning work. A lot of this has to do with binary classification – classifying something as either a 0 or a 1. Data scientists build models, which give out a probability score that the thing is a 1, and then use some (sometimes arbitrary) cutoff to determine whether the thing is a 0 or a 1.

There are a bunch of metrics in data science on how good a model is, and it all comes down to what the model predicted and what “really” happened. And I’ve seen data scientists work super hard to improve on these accuracy measures. What can be done to predict a little bit better? Why is this model only giving me 77% ROC-AUC when for the other problem I was able to get 90%?

The thing is – if the variable you are trying to predict is something like whether Salah will score from a particular chance, your accuracy metric will be really low indeed. Because he is fundamentally unpredictable. It is the same with some of the machine learning stuff – a lot of models are trying to predict something that is fundamentally unpredictable, so there is a limit on how accurate the model will get.

The problem is that you would have come across several problem statements that are much more predictable that you think it is a problem with you (or your model) that you can’t predict better. Pundits (or Jose) would have seen so many strikers who predictably score from good chances that they think Salah is not good.

The solution in these cases is to look at aggregates. Looking for each single prediction will not take us anywhere. Instead, can we predict over a large set of data whether we broadly got it right? In my “research” for this blogpost, I found this.

Last season, on average, Salah scored precisely as many goals as the model would’ve predicted! You might remember stunners like the one against Manchester City at Anfield. So you know where things got averaged out.