Bankers predicting football

So the Football World Cup season is upon us, and this means that investment banking analysts are again engaging in the pointless exercise of trying to predict who will win the World Cup. And the funny thing this time is that thanks to MiFiD 2 regulations, which prevent banking analysts from giving out reports for free, these reports aren’t in the public domain.

That means we’ve to rely on media reports of these reports, or on people tweeting insights from them. For example, the New York Times has summarised the banks’ predictions on the winner. And this scatter plot from Goldman Sachs will go straight into my next presentation on spurious correlations:

Different banks have taken different approaches to predict who will win the tournament. UBS has still gone for a classic Monte Carlo simulation  approach, but Goldman Sachs has gone one ahead and used “four different methods in artificial intelligence” to predict (for the third consecutive time) that Brazil will win the tournament.

In fact, Goldman also uses a Monte Carlo simulation, as Business Insider reports.

The firm used machine learning to run 200,000 models, mining data on team and individual player attributes, to help forecast specific match scores. Goldman then simulated 1 million possible variations of the tournament in order to calculate the probability of advancement for each squad.

But an insider in Goldman with access to the report tells me that they don’t use the phrase itself in the report. Maybe it’s a suggestion that “data scientists” have taken over the investment research division at the expense of quants.

I’m also surprised with the reporting on Goldman’s predictions. Everyone simply reports that “Goldman predicts that Brazil will win”, but surely (based on the model they’ve used), that prediction has been made with a certain probability? A better way of reporting would’ve been to say “Goldman predicts Brazil most likely to win, with X% probability” (and the bank’s bets desk in the UK could have placed some money on it).

ING went rather simple with their forecasts – simply took players’ transfer values, and summed them up by teams, and concluded that Spain is most likely to win because their squad is the “most valued”. Now, I have two major questions about this approach – firstly, it ignores the “correlation term” (remember the famous England conundrum of the noughties of fitting  Gerrard and Lampard into the same eleven?), and assumes a set of strong players is a strong team. Secondly, have they accounted for inflation? And if so, how have they accounted for inflation? Player valuation (about which I have a chapter in my book) has simply gone through the roof in the last year, with Mo Salah at £35 million being considered a “bargain buy”.

Nomura also seems to have taken a similar approach, though they have in some ways accounted for the correlation term by including “team momentum” as a factor!

Anyway, I look forward to the football! That it is live on BBC and ITV means I get to watch the tournament from the comfort of my home (a luxury in England!). Also being in England means all matches are at a sane time, so I can watch more of this World Cup than the last one.


The science of shirt numbers

Yesterday, Michael Cox, author of the Zonal Marking blog and The Mixer, tweeted:

Now, there is some science to how football shirts are numbered. I had touched upon it in a very similar post I had written four years ago. You can also read this account on how players are numbered. And if you’re more curious about formations and their history, I recommend you read Jonathan Wilson’s Inverting the Pyramid.

To put it simply, number 1 is reserved for goalkeepers. Numbers 2 to 6 are for defenders, though some countries use either 4, 5 or 6 for midfielders. 7-11 are usually reserved for attacking midfielders and forwards, with 9 being the “centre forward” and 10 being the “second forward”.

Some of these numbers are so institutionalised that the number is sometimes enough to describe a player’s position and style. This has even led to jargon such as a “False Nine” (a midfielder playing furthest forward) or a “False Ten” (a striker playing in a withdrawn role).

There is less science to the allocation of shirt numbers 12 to 23, since these are not starting positions. One rule of thumb is to allocate these numbers for the backups for the corresponding positions. So 12 is the reserve goalie, 13 is the reserve right back and so on(with 23 for the squad’s third goalkeeper).

So how have teams chosen to number their squads in the FIFA World Cup that starts next week? This picture summarises the distribution of position by number: 


There is no surprise in Number 1, which all teams have allocated to their goalkeeper, and numbers 2 and 3 are mostly allocated to defenders as well (there are some exceptions there, with Iran’s Mehdi Torabi and Denmark’s Michael Krohn Dehli wearing Number 2 even though they are midfielders, and Iceland midfielder Samuel Friojonsson wearing 3).

That different countries use 4, 5 or 6 for midfielders is illustrated in the data, though two forwards (Australian legend Tim Cahill and Croatia’s Ivan Perisic) puzzlingly wear 4 (it’s less puzzling in Cahill’s case since he started as a central midfielder and slowly moved forward).

7 is the right winger’s number, and depending upon that position’s interpretation can either be a midfielder or a forward. 8 is primarily a midfielder, while 9 is (obviously) a striker’s number. Interestingly, five midfielders will wear the Number 9 shirt (the most prominent being Russia’s Alan Dzagoev). 10 and 11 are evenly split between midfielders and forwards, though two defenders (Serbia’s Aleksandr Kolarov and Tunisia’s Dylan Bronn) also wear 11.

Beyond 11, there isn’t that much of a science, but one thing that is clear is that Cox got it wrong – for it isn’t so “textbook” to give 12 to the reserve right back. As we can see from the data, 20 teams have used that number for their reserve goalies!

It’s like England has put their squad numbers into a little bit of a Mixer!

Chasing Dhoni

Former India captain Mahendra Singh Dhoni has a mixed record when it comes to chasing in limited overs games (ODIs and T20s). He initially built up his reputation as an expert chaser, who knew exactly how to pace an innings and accelerate at the right moment to deliver victory.

Of late, though, his chasing has been going wrong, the latest example being Chennai Super Kings’ loss at Kings XI Punjab over the weekend. Dhoni no doubt played excellently – 79 off 44 is a brilliant innings in most contexts. Where he possibly fell short was in the way he paced the innings.

And the algorithm I’ve built to represent (and potentially evaluate) a cricket match seems to have done a remarkable job in identifying this problem in the KXIP-CSK game. Now, apart from displaying how the game “flowed” from start to finish, the algorithm is also designed to pick out key moments or periods in the game.

One kind of “key period” that the algorithm tries to pick is a batsman’s innings – periods of play where a batsman made a significant contribution (either positive or negative) to his team’s chances of winning. And notice how nicely it has identified two distinct periods in Dhoni’s batting:

The first period is one where Dhoni settled down, and batted rather slowly – he hit only 21 runs in 22 balls in that period, which is incredibly slow for a 10 runs per over game. Notice how this period of Dhoni’s batting coincides with a period when the game decisively swung KXIP’s way.

And then Dhoni went for it, hitting 36 runs in 11 balls (which is great going even for a 10-runs-per-over game), including 19 off the penultimate over bowled by Andrew Tye. While this brought CSK back into the game (to right where the game stood prior to Dhoni’s slow period of batting), it was a little too late as KXIP managed to hold on.

Now I understand I’m making an argument using one data point here, but this problem with Dhoni, where he first slows down and then goes for it with only a few overs to go, has been discussed widely. What’s interesting is how neatly my algorithm has picked out these periods!

A banker’s apology

Whenever there is a massive stock market crash, like the one in 1987, or the crisis in 2008, it is common for investment banking quants to talk about how it was a “1 in zillion years” event. This is on account of their models that typically assume that stock prices are lognormal, and that stock price movement is Markovian (today’s movement is uncorrelated with tomorrow’s).

In fact, a cursory look at recent data shows that what models show to be a one in zillion years event actually happens every few years, or decades. In other words, while quant models do pretty well in the average case, they have thin “tails” – they underestimate the likelihood of extreme events, leading to building up risk in the situation.

When I decided to end my (brief) career as an investment banking quant in 2011, I wanted to take the methods that I’d learnt into other industries. While “data science” might have become a thing in the intervening years, there is still a lot for conventional industry to learn from banking in terms of using maths for management decision-making. And this makes me believe I’m still in business.

And like my former colleagues in investment banking quant, I’m not immune to the fat tail problem as well – replicating solutions from one domain into another can replicate the problems as well.

For a while now I’ve been building what I think is a fairly innovative way to represent a cricket match. Basically you look at how the balance of play shifts as the game goes along. So the representation is a line graph that shows where the balance of play was at different points of time in the game.

This way, you have a visualisation that at one shot tells you how the game “flowed”. Consider, for example, last night’s game between Mumbai Indians and Chennai Super Kings. This is what the game looks like in my representation.

What this shows is that Mumbai Indians got a small advantage midway through the innings (after a short blast by Ishan Kishan), which they held through their innings. The game was steady for about 5 overs of the CSK chase, when some tight overs created pressure that resulted in Suresh Raina getting out.

Soon, Ambati Rayudu and MS Dhoni followed him to the pavilion, and MI were in control, with CSK losing 6 wickets in the course of 10 overs. When they lost Mark Wood in the 17th Over, Mumbai Indians were almost surely winners – my system reckoning that 48 to win in 21 balls was near-impossible.

And then Bravo got into the act, putting on 39 in 10 balls with Imran Tahir watching at the other end (including taking 20 off a Mitchell McClenaghan over, and 20 again off a Jasprit Bumrah over at the end of which Bravo got out). And then a one-legged Jadhav came, hobbled for 3 balls and then finished off the game.

Now, while the shape of the curve in the above curve is representative of what happened in the game, I think it went too close to the axes. 48 off 21 with 2 wickets in hand is not easy, but it’s not a 1% probability event (as my graph depicts).

And looking into my model, I realise I’ve made the familiar banker’s mistake – of assuming independence and Markovian property. I calculate the probability of a team winning using a method called “backward induction” (that I’d learnt during my time as an investment banking quant). It’s the same system that the WASP system to evaluate odds (invented by a few Kiwi scientists) uses, and as I’d pointed out in the past, WASP has the thin tails problem as well.

As Seamus Hogan, one of the inventors of WASP, had pointed out in a comment on that post, one way of solving this thin tails issue is to control for the pitch or  regime, and I’ve incorporated that as well (using a Bayesian system to “learn” the nature of the pitch as the game goes on). Yet, I see I struggle with fat tails.

I seriously need to find a way to take into account serial correlation into my models!

That said, I must say I’m fairly kicked about the system I’ve built. Do let me know what you think of this!

English Premier League: Goal Difference to points correlation

So I was just looking down the English Premier League Table for the season, and I found that as I went down the list, the goal difference went lower. There’s nothing counterintuitive in this, but the degree of correlation seemed eerie.

So I downloaded the data and plotted a scatter-plot. And what do you have? A near-perfect regression. I even ran the regression and found a 96% R Square.

In other words, this EPL season has simply been all about scoring lots of goals and not letting in too many goals. It’s almost like the distribution of the goals itself doesn’t matter – apart from the relegation battle, that is!

PS: Look at the extent of Manchester City’s lead at the top. And what a scrap the relegation is!

The Derick Parry management paradigm

Before you ask, Derick Parry was a West Indian cricketer. He finished his international playing career before I was born, partly because he bowled spin at a time when the West Indies usually played four fearsome fast bowlers, and partly because he went on rebel tours to South Africa.

That, however, doesn’t mean that I never watched him play – there was a “masters” series sometime in the mid 1990s when he played as part of the ‘West Indies masters” team. I don’t even remember who they were playing, or where (such series aren’t archived well, so I can’t find the score card either).

All I remember is that Parry was batting along with Larry Gomes, and the West Indies Masters were chasing a modest target. Parry is relevant to our discussion because of the commentator’s (don’t remember who – it was an Indian guy) repeated descriptions of how he should play.

“Parry should not bother about runs”, the commentator kept saying. “He should simply use his long reach and smother the spin and hold one end up. It is Gomes who should do the scoring”. And incredibly, that’s how West Indies Masters got to the target.

So the Derick Parry management paradigm consists of eschewing all the “interesting” or “good” or “impactful” work (“scoring”, basically. no pun intended), and simply being focussed on holding one end up, or providing support. It wasn’t that Parry couldn’t score – he had at Test batting average of 22, but on that day the commentator wanted him to simply hold one end up and let the more accomplished batsman do the scoring.

I’ve seen this happen at various levels, but this usually happens at the intra-company level. There will be one team which will explicitly not work on the more interesting part of the problem, and instead simply “provide support” to another team that works on this stuff. In a lot of cases it is not that the “supporting team” doesn’t have the ability or skills to execute the task end-to-end. It just so happens that they are a part of the organisation which is “not supposed to do the scoring”. Most often, this kind of a relationship is seen in companies with offshore units – the offshore unit sticks to providing support to the onshore unit, which does the “scoring”.

In some cases, the Derick Parry school goes to inter-company deals as well, and in such cases it is usually done so as to win the business. Basically if you are trying to win an outsourcing contract, you don’t want to be seen doing something that the client considers to be “core business”. And so even if you’re fully capable of doing that, you suppress that part of your offering and only provide support. The plan in some cases is to do a Mustafa’s camel, but in most cases that doesn’t succeed.

I’m not offering any comment on whether the Derick Parry strategy of management is good or not. All I’m doing here is to attach this oft-used strategy to a name, one that is mostly forgotten.

PM’s Eleven

The first time I ever heard of Davos was in 1997, when then Indian Prime Minister HD Deve Gowda attended the conference in the ski resort and gave a speech. He was heavily pilloried by the Kannada media, and given the moniker “Davos Gowda”.

Maybe because of all the attention Deve Gowda received for the trip, and not in a good way, no Indian Prime Minister ventured to go there for another twenty years. Until, of course, Narendra Modi went there earlier this week and gave a speech that apparently got widely appreciated in China.

There is another thing that connects Modi and Deve Gowda as Prime Ministers (leaving aside trivialties such as them being chief ministers of their respective states before becoming Prime Ministers).

Back in 1996 when Deve Gowda was Prime Minister, Rahul Dravid,  Venkatesh Prasad and Sunil Joshi made their Test debuts (on the tour of England). Anil Kumble and Javagal Srinath had long been fixtures in the Indian cricket team. Later that year, Sujith Somasunder played a couple of one dayers. David Johnson played two Tests. And in early 1997, Doddanarasaiah Ganesh played a few Test matches.

In case you haven’t yet figured out, all these cricketers came from Karnataka, the same state as the Prime Minister. During that season, it was normal for at least five players in the Indian Eleven to be from Karnataka. Since Deve Gowda had become Prime Minister around the same time, there was no surprise that the Indian cricket team was called “PM’s Eleven”. Coincidentally, the chairman of selectors at that point in time was Gundappa Vishwanath, who is also from Karnataka.

The Indian team playing in the current Test match in Johannesburg has four players from Gujarat. Now, this is not as noticeable as five players from Karnataka because Gujarat is home to three Ranji Trophy teams. Cheteshwar Pujara plays for Saurashtra, Parthiv Patel and Jasprit Bumrah play for Gujarat, and Hardik Pandya plays for Baroda. And Saurashtra’s Ravindra Jadeja is also part of the squad.

It had been a long time since once state had thus dominated the Indian cricket team. Perhaps we hadn’t seen this kind of domination since Karnataka had dominated in the late 1990s. And it so happens that once again the state dominating the Indian cricket team happens to be the Prime Minister’s home state.

So after a gap of twenty one years, we had an Indian Prime Minister addressing Davos. And after a gap of twenty one years, we have an Indian cricket team that can be called “PM’s Eleven”!

As Baada put it the other day, “Modi is the new Deve Gowda. Just without family and sleep”.

Update: I realised after posting that I have another post called “PM’s Eleven” on this blog. It was written in the UPA years.