Bankers predicting football

So the Football World Cup season is upon us, and this means that investment banking analysts are again engaging in the pointless exercise of trying to predict who will win the World Cup. And the funny thing this time is that thanks to MiFiD 2 regulations, which prevent banking analysts from giving out reports for free, these reports aren’t in the public domain.

That means we’ve to rely on media reports of these reports, or on people tweeting insights from them. For example, the New York Times has summarised the banks’ predictions on the winner. And this scatter plot from Goldman Sachs will go straight into my next presentation on spurious correlations:

Different banks have taken different approaches to predict who will win the tournament. UBS has still gone for a classic Monte Carlo simulation  approach, but Goldman Sachs has gone one ahead and used “four different methods in artificial intelligence” to predict (for the third consecutive time) that Brazil will win the tournament.

In fact, Goldman also uses a Monte Carlo simulation, as Business Insider reports.

The firm used machine learning to run 200,000 models, mining data on team and individual player attributes, to help forecast specific match scores. Goldman then simulated 1 million possible variations of the tournament in order to calculate the probability of advancement for each squad.

But an insider in Goldman with access to the report tells me that they don’t use the phrase itself in the report. Maybe it’s a suggestion that “data scientists” have taken over the investment research division at the expense of quants.

I’m also surprised with the reporting on Goldman’s predictions. Everyone simply reports that “Goldman predicts that Brazil will win”, but surely (based on the model they’ve used), that prediction has been made with a certain probability? A better way of reporting would’ve been to say “Goldman predicts Brazil most likely to win, with X% probability” (and the bank’s bets desk in the UK could have placed some money on it).

ING went rather simple with their forecasts – simply took players’ transfer values, and summed them up by teams, and concluded that Spain is most likely to win because their squad is the “most valued”. Now, I have two major questions about this approach – firstly, it ignores the “correlation term” (remember the famous England conundrum of the noughties of fitting  Gerrard and Lampard into the same eleven?), and assumes a set of strong players is a strong team. Secondly, have they accounted for inflation? And if so, how have they accounted for inflation? Player valuation (about which I have a chapter in my book) has simply gone through the roof in the last year, with Mo Salah at £35 million being considered a “bargain buy”.

Nomura also seems to have taken a similar approach, though they have in some ways accounted for the correlation term by including “team momentum” as a factor!

Anyway, I look forward to the football! That it is live on BBC and ITV means I get to watch the tournament from the comfort of my home (a luxury in England!). Also being in England means all matches are at a sane time, so I can watch more of this World Cup than the last one.


The science of shirt numbers

Yesterday, Michael Cox, author of the Zonal Marking blog and The Mixer, tweeted:

Now, there is some science to how football shirts are numbered. I had touched upon it in a very similar post I had written four years ago. You can also read this account on how players are numbered. And if you’re more curious about formations and their history, I recommend you read Jonathan Wilson’s Inverting the Pyramid.

To put it simply, number 1 is reserved for goalkeepers. Numbers 2 to 6 are for defenders, though some countries use either 4, 5 or 6 for midfielders. 7-11 are usually reserved for attacking midfielders and forwards, with 9 being the “centre forward” and 10 being the “second forward”.

Some of these numbers are so institutionalised that the number is sometimes enough to describe a player’s position and style. This has even led to jargon such as a “False Nine” (a midfielder playing furthest forward) or a “False Ten” (a striker playing in a withdrawn role).

There is less science to the allocation of shirt numbers 12 to 23, since these are not starting positions. One rule of thumb is to allocate these numbers for the backups for the corresponding positions. So 12 is the reserve goalie, 13 is the reserve right back and so on(with 23 for the squad’s third goalkeeper).

So how have teams chosen to number their squads in the FIFA World Cup that starts next week? This picture summarises the distribution of position by number: 


There is no surprise in Number 1, which all teams have allocated to their goalkeeper, and numbers 2 and 3 are mostly allocated to defenders as well (there are some exceptions there, with Iran’s Mehdi Torabi and Denmark’s Michael Krohn Dehli wearing Number 2 even though they are midfielders, and Iceland midfielder Samuel Friojonsson wearing 3).

That different countries use 4, 5 or 6 for midfielders is illustrated in the data, though two forwards (Australian legend Tim Cahill and Croatia’s Ivan Perisic) puzzlingly wear 4 (it’s less puzzling in Cahill’s case since he started as a central midfielder and slowly moved forward).

7 is the right winger’s number, and depending upon that position’s interpretation can either be a midfielder or a forward. 8 is primarily a midfielder, while 9 is (obviously) a striker’s number. Interestingly, five midfielders will wear the Number 9 shirt (the most prominent being Russia’s Alan Dzagoev). 10 and 11 are evenly split between midfielders and forwards, though two defenders (Serbia’s Aleksandr Kolarov and Tunisia’s Dylan Bronn) also wear 11.

Beyond 11, there isn’t that much of a science, but one thing that is clear is that Cox got it wrong – for it isn’t so “textbook” to give 12 to the reserve right back. As we can see from the data, 20 teams have used that number for their reserve goalies!

It’s like England has put their squad numbers into a little bit of a Mixer!

What’s in a shirt number?

There is a traditional way of allotting shirt numbers to football players. “Back to front, right to left”, goes the rule. The goalkeeper is thus number 1. Irrespective of the system used, the right back is number 2, and usually the left forward/winger is number 11.

Now, the way different teams allot numbers depends upon their historical formations, and how their current formations have evolved from those historical formations. The two historical formations that are the 2-3-5 (mostly played in Europe) and the W-M (which originated in South America).

You can read Jonathan Wilson’s excellent Inverting the Pyramid to find more about how formations evolved. This post, however, is about shirt numbers in the ongoing world cup.

Now, given the way numbering has evolved in different countries, each number (between 1 and 11) has a traditional set of roles involved. 1 is the goalie everywhere, 2 is right back everywhere. 3 is left back in Europe, but right centre back in South America. 4 is central midfielder in England, but centre back in Spain and South America. 6 is a central defender in England, left back in Brazil/Argentina and a midfielder in Spain.

These are essentially numbering conventions based on how numbering systems have evolved, but are seldom a rule. However, such conventions are so ingrained in the traditional football watcher’s mind that when a player wears a shirt number that is not normally associated with his position, it appears “wrong”.

For example, William Gallas, a centre back (and occasional right back for Chelsea) by trade moved to Arsenal in 2006 and promptly got the number 10 shirt, which is usually reserved for a central attacker/attacking midfielder (in fact, the number now defines the role – it is simply called the “number 10 role”). In the last season, West Ham used two successive left backs (Razvan Rat and Pablo Armero) last season, and both were allotted number 8 – traditionally allocated to a central midfielder.

In this post, we will look at the squads of the ongoing world cup and try and understand how many players are wearing “wrong” shirt numbers. In order to do this, we look at the most common roles associated with a particular number, and identify any players that don’t fit this convention.

Figures 1 and 2 have the summary of the distribution of roles according to shirt number.




As we can see, all number 1s are goalkeepers (perhaps there is a FIFA rule to this effect). Most number 2s and 3s are defenders, but there is the odd midfielder and forward also who wears this. Iranian forward Khosro Heydari wears 2, as do Greek midfielder Ioannais Maniatis and Bosnian  midfielder Avdija Vrsajevic.

The most unnatural number 3 (in his defence, he’s always worn 3) is Ghanaian striker Asamoah Gyan. Iranian midfielder Ehsan Aji Safi also wears a 3, contrary to convention.

As discussed earlier, midfielders from a few countries wear 4, but there are also two forwards who wear that number – Japanese Keisuke Honda and Australian Tim Cahill. This can be explained by the fact that both of them started off as midfielders, and then turned into forwards, but perhaps wanted to keep their original numbers.

5 is split entirely between defenders and midfielders, who also make up for most of the number 6s. The one exception to this is Russia’s Maksim Kanunnikov, who is a forward. Interestingly, as many as six number 7s (associated with a right winger in both 2-3-5 and W-M systems) are listed as defenders! This includes Colombia’s left back Armero who notoriously wore 8 for West Ham last year. This might possibly be explained by players who started off as wingers and then moved back, but kept their numbers. Two defenders – Costa Rica’s Heiner Mora and Australia’s Bailey Wright wear number 8.

Number 9 is again one of those numbers which is associated with a specific role – a centre forward. In fact, in recent times, there is a variation of this called the “false nine” (there is also a “false ten” now). We would thus expect that all number nines are number nines, but a few midfielders also get that number. Prominent among those is Newcastle’s Cheick Tiote, who wears 9 for Cote D’Ivoire.

10 is split between midfielders and forwards (as expected), but a few defenders wear 11. Croatian captain and right back Darijo Srna wears 11, as also does Greek defender Loukas Vyntra.

Beyond 11, there is no real convention in terms of shirt numbering. The only interesting thing is in the numbers allotted to the reserve goalkeepers (notice that no goalies take any number between 2 and 11). By far, 12 is the most popular number allotted to the reserve goalkeeper, but some teams use 13 as well. Then, 22 and 23 are also pretty popular numbers for goalkeepers.

Finally, we saw that Iran was the culprit in allocating numbers 2 and 3 to non-defenders. Greece, too, came up as a repeat offender in terms of allocating inappropriate numbers. Can we build a “number convention index” and see which countries deviated most from the numbering conventions?

Now, there are degrees in being unconventional, and these need to be accommodated into the analysis. For example, a midfielder wearing 4 (there are 6 of them) is pretty normal, but a forward wearing 3 is simply plain wrong. A forward wearing 8 is not “correct”, but not “wrong” either – this shows that we need more than a simple binary scoring system.

What we will do is to first identify the most common player type for each number, and every such player will get a score of 1. For every other player wearing that number, the score will be the number of such players wearing that number divided by the number of players wearing that number who occupy the most popular position for that number.

I’m assuming the last paragraph didn’t make sense so let me use an example. To use number 2, the most popular position for a number 2 is in defence, so every defender who wears 2 gets 1 point. There are two defenders who wear 2, compared to 29 defenders who wear 2. Thus, each defender who wears 2 gets 2/29 points. One forward wears 2, and he gets 1/29 point.

Taking number 10, the most common position for the number is forward (there are 17 of them), and they all get 1 point. The remaining 15 players who wear 10 are all midfielders, and they get 15/17 points (notice this is not so much less than 1).

This way, each member of each squad gets allotted points based on how “normal” his shirt number is given his position. Summing up the points across players of a team, we get a team score on how “natural” the shirt numbers are. The maximum score a team can get is 23 (each player wearing a number appropriate for his position).

Table 3 here has the team-wise information on correctness of shirt numbers. The team with the worst allocated shirt numbers happens to be Nigeria with 16.13. At the other end, the team that has allocated numbers most appropriately is Ecuador, with 21.

Country  Score
Nigeria         16.13
Costa Rica         16.85
Greece         17.22
Australia         17.45
Iran         17.69
Ivory Coast         17.74
Colombia         18.19
Cameroon         18.19
Argentina         18.29
USA         18.41
Italy         18.63
Portugal         18.66
Algeria         18.68
Honduras         18.69
France         19.02
Ghana         19.06
Croatia         19.08
Netherlands         19.20
Mexico         19.23
Brazil         19.28
Chile         19.37
Japan         19.52
South Korea         19.69
England         19.77
Russia         20.03
Switzerland         20.04
Uruguay         20.25
Spain         20.27
Bosnia & Herzegovina         20.37
Belgium         20.83
Germany         20.86
Ecuador         21.01


This, however, may not tell the complete story. As we saw earlier, conventions regarding numbers between 12 and 23 are not as strict, and thus these numbers can get allocated in a more random fashion compared to 1-11. There are absolutely no taboos related to numbers 12-23, and thus, misallocating them is less of a crime than misallocating 1-11.

Hence, we will look at the numbers 1 to 11, and see how teams have performed. Table 4 has this information:

Country  Score
Australia           7.45
USA           8.04
Iran           8.18
Greece           8.59
Nigeria           8.63
Ivory Coast           8.71
Costa Rica           8.84
Ghana           8.95
Croatia           8.96
Japan           9.06
Colombia           9.27
Brazil           9.29
Spain           9.66
Bosnia & Herzegovina           9.67
Uruguay           9.70
Honduras           9.71
Portugal           9.78
Italy           9.81
South Korea           9.83
Cameroon           9.83
Chile           9.87
Russia           9.94
England           9.97
Argentina         10.03
Switzerland         10.18
Netherlands         10.24
Algeria         10.30
Ecuador         10.34
Mexico         10.35
France         10.53
Belgium         10.88
Germany         11.00


Germany has a “perfect” first eleven, in terms of number allocation. Belgium comes close. At the other end of the scale, we have Australia, which seems to have the most misallocated 1-11 shirt numbers. Iran and Greece, which we anecdotally saw as having high misallocations are at three and four, with the United States at 2.

Note: The data is taken from the Guardian Data Blog. Now, this analysis should be taken with some salt since in the modern game, the division of players into “defender”, “midfielder” and “forward” is not straightforward. Where would you put a “classic number ten”? What about a wing back? And so forth.

Identifying Groups of Death

The Guardian has an interesting set of graphics trying to identify the “Group of Death” at the forthcoming (2014) football World Cup. They have basically ordered groups and teams on three counts – something called as “strength of schedule” (how it is calculated is not explained), average strength of each group (mean rating points) and the strength of each match (sum total of rating points of the teams playing). They don’t actually go on to identify which the groups of death are. 

Another piece in the same paper gives a history of the concept of the Group of Death, and tries to explain why some groups can be classified so while others cannot. So in this post we will focus on precisely that – once a draw has been made, how do we identify groups of death? Without loss of generality, let us restrict our analysis to groups of four teams from which two qualify for the next round following a round robin (the format the World Cup uses). We will also restrict ourselves to analyzing the group stage and ignore chances of “death” in the knockout stages.

A “group of death” traditionally refers to a group where at least one “favourite”  team gets knocked out. Assuming that a team with higher odds of winning the tournament is likely to beat one with lower odds, a group of death is necessarily one that contains at least three teams that are “favourites” to win the tournament.

From this, one way to measure groups of death is to order teams in decreasing order of odds of winning based on a reputed bookie’s odds, and then see how closely the top three teams of a group are clustered. The closer three teams are to each other, the closer the group is. We can use a distance metric to measure this.

Another simpler method is to see the odds of the third team in a group winning the world cup. The groups where the third best odds of winning are the groups of death! Again this is a relative metric since if each group has two “strong teams” and two “weak teams” there is effectively no group of death (hence the earlier metric trumps this one).

Another way to identify how deadly the groups are is to use bilateral odds for each match, and to identify the odds that the two “seeded teams” in a group both don’t qualify. For example, Group B has Spain, Netherlands, Australia and Chile, with the first two being the “seeded teams” (given their ranking). Now, we can calculate the probability that at least one of Spain and Netherlands doesn’t qualify. That gives the “death rating” for this group. The group for which this “death rating” is highest is the group of death.

As you can see, there are several ways for identifying the group of death. Unfortunately, none of the analysis that the Guardian has put out contributes to this. Let us now look at a couple of methods for ourselves. For the purpose of analysis I’m using the easiest available odds, which are from the Bleacher Report. Ideally, for this analysis we should be using odds before the draw was made – since the draw itself would have ended up adjusting odds. Nevertheless, since this is for illustrative (rather than predictive) purposes only, we’ll stick to the current odds.

Let us start with the easiest method, which is the odds of victory of the third best team in the group. Based on the Bleacher odds, the third best teams in each group are likely to be :

A Mexico 150/1
B Chile 33/1
C Japan 150/1
D England 28/1
E Ecuador 150/1
F Nigeria 250/1
G United States 150/1
H South Korea 150/1

Two teams stand out – Chile at 33/1 and England at 28/1. Based on this metric, the group of death is Group D (Italy, England, Uruguay, Costa Rica). The Guardian might say that Australia, Ghana or the United States might have the toughest draw, but the odds of each of them winning is so low that it doesn’t matter that they have tough draws!

Let us now use another metric – the difference between the odds of the second and third placed teams in each group. One metric of the group of death might be where this difference is the minimum (this metric has the problem of classifying groups with one clear winner as groups of death, while they technically are not).

And this metric identifies Group F (Bosnia, Nigeria)  and Group A (Croatia, Mexico) as groups of death. You might notice that these are Argentina and Brazil’s groups respectively and those two teams are expected to sail through, so this is not a good metric.

Next, let us involve the top three teams of each group (to prevent the above anomaly) and look at the sum of the absolute difference in odds. For example, if the odds of the top three teams in a group are o1, o2, o3, we will measure each group by (|o1-o2| + |o2-o3| + |o3-o1|). The smaller this sum is, the more likely a group is a “group of death”.

The results from this metric are below:

A 42% Brazil, Croatia, Mexico
B 16% Spain, Netherlands, Chile
C 7% Colombia, Japan, Greece
D 0.7% Uruguay, Italy, England
E 7% Switzerland, France, Ecuador
F 27% Argentina, Bosnia, Nigeria
G 23% Germany, Portugal, United States
H 10% Belgium, Russia, South Korea.

From this metric, it is absolutely clear which the most competitive group is – it is group D, with Uruguay, Italy and England. Based on this metric, it is unambiguous that Group D is the group of death. Groups C and E come next according to this measure, followed by Group H.

Sachin’s 100th

In the end it was quite appropriate. That the needlessly hyped “false statistic” of Sachin’s 100 100s came about in a match against a supposed minnow, in an inconsequential tournament, which didn’t even help India win the game. The hype surrounding this statistic had become unbearable, both for normal cricket fans and also for Sachin, perhaps. And that could be seen in his batting over the last one year, in England and in Australia. There was a distinct feeling that every time he just kept playing for his century, and not for the team cause, and the only upshot of his “100th 100” is that the monkey is finally off his back and hopefully Sachin can go back to playing normal cricket.

Unfortunately, there are a couple of other milestones round the corner. He now has 49 ODI 100s, so now people will hype up his 50th. And as someone pointed out on facebook yesterday, he has 199 international wickets! Hopefully that means he starts turning his arm over once again, with his lethal spinning leg-breaks and long hops.

The thing with Sachin is that he has always seemed to be statistically minded (irrespective of what he says in his interviews). The mind goes back to Cuttack during World Cup 1996, when he played out two maiden overs against Asif Karim while trying to get to his 100 (against Kenya). Even in recent times, including in 2007 when he got out in the 90s a large number of times, it is noticeable how he suddenly slows down the innings once he gets into the 90s. He gets nervous, starts thinking only about the score, and not about batting normally.

In that sense, it is appropriate that this meaningless statistic of a hundredth hundred came about in a game that India lost, to a supposed minnow. It was a “batting pitch”. As Raina and Dhoni showed in the latter stages of the innings, shotmaking wasn’t particularly tough. And yet, what did Sachin do? Plod at a strike rate of 75 for most of the innings, including in the crucial batting powerplay just so that he could get to his 100. I don’t fault his batting for the first 35 overs. He did what was required to set up a solid foundation, in Kohli’s company. But in the batting powerplay, instead of going for it, the only thing on his mind was the century. Quite unfortunate. And appropriate, as I’ve said a number off times earlier.

Again, I want to emphasize that I’m NOT an anti-Sachintard. I’ve quite enjoyed his batting in the past, and there is no question that he is one of the all-time great cricketers. I’m only against meaningless stat-tardness. And it was this retardation about a meaningless stat that prevented Sachin from giving his best for the last one year.


As the World Cup starts I realize I’m liking ODI cricket more now than I used to in the last couple of years. The key thing for me, I think, is the second coming of classical batsmen to One Day Cricket.

The problem with ODIs in the mid 2000s was that it had become a slambang game. Too many slambang players, with dodgy techniques were dominating the scenes. Boundaries got pulled in and pitches became flat (these two are still a problem I must say) and it just degenerated into slugfests. It was, to use a famous phrase, just not cricket.

In a way, I think the coming of T20 has actually helped make the ODIs a more classical game. What it has done is to make the slambang guys specialize in the even more slambang version (it has helped that there is a lot of money to be made by being good at T20).

Suddenly the slambang guys have figured that they’ve lost the skill of building an innings, which is something crucial for the one day game. If your team has to score 300, it is very likely that at least one batsman has to get something like a 100, and scoring 100s is out of the skill-set of the slambangers.

So you see the likes of “holding players” like Hashim Amla and Jonathan Trott coming good at ODIs, while in the mid-to-late noughties they would’ve never been selected for what was then the “shorter form of the game”.

Also, the quality of cricket in some recent ODI series (RSA-Ind, RSA-Pak, etc.) has been encouraging, and if not for the idiotic format I would’ve been really looking forward to the World Cup.

Srinath and Mithun

As soon as Abhimanyu Mithun took a hat-trick on his Ranji debut, comparisons started with that other Karnataka fast bowler who did the same – Javagal Srinath. However, given the way things are with his career now – dropped after a not-so-bad debut series in Sri Lanka, and following that up with an unspectacular Ranji season – it’s unlikely he will have the same kind of impact.

Ability apart (Mithun so far hasn’t shown signs of bowling anywhere as fast as Srinath did), what might make a major difference in their respective careers is in terms of handling by the selectors and the team management.

One has to really give it to Azhar, Abbas Ali Baig (then the team manager) and whoever was in the selection committee back then for the way they managed Srinath’s early career.

Just take a look at his profile on statsguru: he didn’t take a 4-for until his ninth Test match. In his preceding eight Test matches, he had a bowling average of 46 (and he was dropped once – because the pitch at St George’s Park in Port Elizabeth was supposed to take spin).  And in the meantime, he had played a World Cup – having taken part in all matches that India played.

Of course he was dropped immediately after his 4-for to make way for a 3-man spin attack. But he was always kept in the squad, and Azhar made it clear to him that he would always play whenever India wanted to play 3 quick men (the first time ever that he was dropped for another fast bowler was perhaps in the finals of Singer Cup in Sri Lanka in 1994 when he made way for Venky Prasad).

Considering how much India chopped and changed with the support attack to Kapil and Prabhakar in the late 80’s it is indeed surprising the way they gave Srinath a long rope. And it paid off magnificently well, in the way he carried India’s bowling attack in the mid to late 90s.

Maybe it was because of his pace, and no one else was close to being as quick.

Compare that to the handling of Mithun. After playing a full series in Sri Lanka, on flat pitches and not bowling too badly, Mithun finds himself completely out of the picture. Not even the fifth best bowler, it seems. Given the way he has been handled, I won’t be surprised if he fades away.

Again, he is nowhere as quick as Srinath though he is reputed to have once been. My cousin Sandeep who knows the insides of Karnataka cricket tells me that Mithun had a back injury even before he made his first class debut, which perhaps explains the drop in place.

But it is perhaps the way he has been handled by the national selectors that would be responsible if his career were to fizzle (the same applies to other “bad drops”, also, though I must say that Murali Kartik has done quite well despite having been handled so shabbily).

PS: I expect a number of you to comment that he’s not that great a bowler. Simple reasons why I’ve used his case rather than anyone else is because he plays for the Ranji team I support, and he is fresh in my mind considering I’ve been watching him in the Ranji QF against MP today)