Chasing Dhoni

Former India captain Mahendra Singh Dhoni has a mixed record when it comes to chasing in limited overs games (ODIs and T20s). He initially built up his reputation as an expert chaser, who knew exactly how to pace an innings and accelerate at the right moment to deliver victory.

Of late, though, his chasing has been going wrong, the latest example being Chennai Super Kings’ loss at Kings XI Punjab over the weekend. Dhoni no doubt played excellently – 79 off 44 is a brilliant innings in most contexts. Where he possibly fell short was in the way he paced the innings.

And the algorithm I’ve built to represent (and potentially evaluate) a cricket match seems to have done a remarkable job in identifying this problem in the KXIP-CSK game. Now, apart from displaying how the game “flowed” from start to finish, the algorithm is also designed to pick out key moments or periods in the game.

One kind of “key period” that the algorithm tries to pick is a batsman’s innings – periods of play where a batsman made a significant contribution (either positive or negative) to his team’s chances of winning. And notice how nicely it has identified two distinct periods in Dhoni’s batting:

The first period is one where Dhoni settled down, and batted rather slowly – he hit only 21 runs in 22 balls in that period, which is incredibly slow for a 10 runs per over game. Notice how this period of Dhoni’s batting coincides with a period when the game decisively swung KXIP’s way.

And then Dhoni went for it, hitting 36 runs in 11 balls (which is great going even for a 10-runs-per-over game), including 19 off the penultimate over bowled by Andrew Tye. While this brought CSK back into the game (to right where the game stood prior to Dhoni’s slow period of batting), it was a little too late as KXIP managed to hold on.

Now I understand I’m making an argument using one data point here, but this problem with Dhoni, where he first slows down and then goes for it with only a few overs to go, has been discussed widely. What’s interesting is how neatly my algorithm has picked out these periods!

A banker’s apology

Whenever there is a massive stock market crash, like the one in 1987, or the crisis in 2008, it is common for investment banking quants to talk about how it was a “1 in zillion years” event. This is on account of their models that typically assume that stock prices are lognormal, and that stock price movement is Markovian (today’s movement is uncorrelated with tomorrow’s).

In fact, a cursory look at recent data shows that what models show to be a one in zillion years event actually happens every few years, or decades. In other words, while quant models do pretty well in the average case, they have thin “tails” – they underestimate the likelihood of extreme events, leading to building up risk in the situation.

When I decided to end my (brief) career as an investment banking quant in 2011, I wanted to take the methods that I’d learnt into other industries. While “data science” might have become a thing in the intervening years, there is still a lot for conventional industry to learn from banking in terms of using maths for management decision-making. And this makes me believe I’m still in business.

And like my former colleagues in investment banking quant, I’m not immune to the fat tail problem as well – replicating solutions from one domain into another can replicate the problems as well.

For a while now I’ve been building what I think is a fairly innovative way to represent a cricket match. Basically you look at how the balance of play shifts as the game goes along. So the representation is a line graph that shows where the balance of play was at different points of time in the game.

This way, you have a visualisation that at one shot tells you how the game “flowed”. Consider, for example, last night’s game between Mumbai Indians and Chennai Super Kings. This is what the game looks like in my representation.

What this shows is that Mumbai Indians got a small advantage midway through the innings (after a short blast by Ishan Kishan), which they held through their innings. The game was steady for about 5 overs of the CSK chase, when some tight overs created pressure that resulted in Suresh Raina getting out.

Soon, Ambati Rayudu and MS Dhoni followed him to the pavilion, and MI were in control, with CSK losing 6 wickets in the course of 10 overs. When they lost Mark Wood in the 17th Over, Mumbai Indians were almost surely winners – my system reckoning that 48 to win in 21 balls was near-impossible.

And then Bravo got into the act, putting on 39 in 10 balls with Imran Tahir watching at the other end (including taking 20 off a Mitchell McClenaghan over, and 20 again off a Jasprit Bumrah over at the end of which Bravo got out). And then a one-legged Jadhav came, hobbled for 3 balls and then finished off the game.

Now, while the shape of the curve in the above curve is representative of what happened in the game, I think it went too close to the axes. 48 off 21 with 2 wickets in hand is not easy, but it’s not a 1% probability event (as my graph depicts).

And looking into my model, I realise I’ve made the familiar banker’s mistake – of assuming independence and Markovian property. I calculate the probability of a team winning using a method called “backward induction” (that I’d learnt during my time as an investment banking quant). It’s the same system that the WASP system to evaluate odds (invented by a few Kiwi scientists) uses, and as I’d pointed out in the past, WASP has the thin tails problem as well.

As Seamus Hogan, one of the inventors of WASP, had pointed out in a comment on that post, one way of solving this thin tails issue is to control for the pitch or  regime, and I’ve incorporated that as well (using a Bayesian system to “learn” the nature of the pitch as the game goes on). Yet, I see I struggle with fat tails.

I seriously need to find a way to take into account serial correlation into my models!

That said, I must say I’m fairly kicked about the system I’ve built. Do let me know what you think of this!

English Premier League: Goal Difference to points correlation

So I was just looking down the English Premier League Table for the season, and I found that as I went down the list, the goal difference went lower. There’s nothing counterintuitive in this, but the degree of correlation seemed eerie.

So I downloaded the data and plotted a scatter-plot. And what do you have? A near-perfect regression. I even ran the regression and found a 96% R Square.

In other words, this EPL season has simply been all about scoring lots of goals and not letting in too many goals. It’s almost like the distribution of the goals itself doesn’t matter – apart from the relegation battle, that is!

PS: Look at the extent of Manchester City’s lead at the top. And what a scrap the relegation is!

Mike Hesson and cricket statistics

While a lot is made of the use of statistics in cricket, my broad view based on presentation of statistics in the media and the odd player/coach interview is that cricket hasn’t really learnt how to use statistics as it should. A lot of so-called insights are based on small samples, and coaches such as Peter Moores have been pilloried for their excess focus on data.

In this context, I found this interview with New Zealand coach Mike Hesson in ESPNCricinfo rather interesting. From my reading of the interview, he seems to “get” data and how to use it, and helps explain the general over-performance to expectations of the New Zealand cricket team in the last few years.

Some snippets:

You’re trying to look at trends rather than chuck a whole heap of numbers at players.

For example, if you look at someone like Shikhar Dhawan, against offspin, he’s struggled. But you’ve only really got a nine or ten-ball sample – so you’ve got to make a decision on whether it’s too small to be a pattern

Also, players take a little while to develop. You’re trying to select the player for what they are now, rather than what their stats suggest over a two or three-year period.

And there are times when you have to revise your score downwards. In our first World T20 match, in Nagpur, we knew it would slow up,


Go ahead and read the whole thing.

What did Brendan in? Priors? The schedule? Or the cups?

So Brendan Rodgers has been sacked as Liverpool manager, after what seems like an indifferent start to the season. The club is in tenth position with 12 points after 8 games, with commentators noting that “at the same stage last season” the club had 13 points from 8 games.

Yet, the notion of “same stage last season” is wrong, as I’d explained in this post I’d written two years back (during Liverpool’s last title chase), since the fixture list changes year on year. As I’ve explained in that post, a better way to compare a club’s performance is to compare its performance this season to corresponding fixtures from last season.

Looking at this season from such a lens (and ignoring games against promoted teams Bournemouth and Norwich), this is what Liverpool’s season so far looks like:

Fixture This season Last season Difference
Stoke away Win Loss +3
Arsenal away Draw Loss +1
West Ham home Loss Win -3
Manchester United Away Loss Loss 0
Aston Villa home Win Loss +3
Everton away Draw Draw 0

In other words, compared to similar fixtures last season, Liverpool is on a +4 (winning two games and drawing one among last season’s losses, and losing one of last season’s wins). In fact, if we look at the fixture schedule, apart from the games against promoted sides (which Liverpool didn’t do wonderfully in, scraping through with an offside goal against Bournemouth and drawing with Norwich), Liverpool have had a pretty tough start to the season in terms of fixtures.

So the question is what led to Brendan Rodgers’ dismissal last night? Surely it can’t be the draw at Everton, for that has become a “standard result” of late? Maybe the fact that Liverpool didn’t win allowed the management to make the announcement last evening, but surely the decision had been made earlier?

The first possibility is that the priors had been stacked against Rodgers. Considering the indifferent performance last season in both the league (except for one brilliant spell) and the cups, and the sacking of Rodgers’ assistants, it’s likely that the benefit of the doubt before the season began was against Rodgers, and only a spectacular performance could have turned it around.

The other possibility is indifferent performances in the cups, with 1-1 home draws against FC Sion and Carlisle United being the absolute low points, in fixtures that one would have expected Liverpool to win easily (albeit with weakened sides). While Liverpool is yet to exit any cup, indifferent performances so far meant that there hasn’t been much improvement in the squad since last season.

Leaving aside a “bad prior” at the beginning of the season and cup performances (no pun intended), there’s no other reason to sack Rodgers. As my analysis above shows, his performance in the league hasn’t been particularly bad in terms of results, with only the defeat to West Ham and possibly the draw to Norwich being bad. If Fenway Sports Group (the owners of Liverpool FC) have indeed sacked Rodgers on his league performance, it simply means that they don’t fully get the “Moneyball” philosophy that they supposedly follow, and could do with some quant consulting.

And if they’re reading this, they should know who to approach for such consulting services!

Valuing loan deals for football players

Initial reports yesterday regarding Radamel Falcao’s move to Manchester United mentioned a valuation of GBP 6 million for the one year loan, i.e. Manchester United had paid Falcao’s parent club AS Monaco GBP 6 million so that they could borrow Falcao for a year. This evidently didn’t make sense since earlier reports suggested that Falcao had been priced at GBP 55 million for an outright transfer, and has four years remaining on his Monaco contract.

In this morning’s reports, however, the value of the loan deal has been corrected to GBP 16 million, which makes more sense in light of his remaining period of contract, age and outright valuation.

So how do you value a loan deal for a player? To answer that, first of all, how do you value a player? The “value” of a player is essentially the amount of money that the player’s parent club is willing to accept in exchange for foregoing his use for the rest of his contract. Hence, for example, in Falcao’s case, GBP 55M  is the amount that Monaco was willing to accept for foregoing the remaining four years they have him on contract.

Based on this, you might guess that transfer fees are (among other things) a function of the number of years that a player has remaining on his contract with the club – ceteris paribus, the longer the period of contract, the greater is the transfer fee demanded (this is intuitive. You want more compensation for foregoing something for a longer time period than for a shorter time period).

From this point of view, let us now evaluate what it might take to take Falcao on loan for one year. Conceptually it is straightforward. Let us assume that the value Monaco expects to get from having Falcao on their books for a further four years is a small amount less than their asking price of GBP 55M – given they were willing to forego their full rights for that amount, their valuation can be any number below that; we’ll assume it was just below that. Now, all we need to do is to determine how much of this GBP 55M in value will be generated in the first year, how much in the second year and so on. Whatever is the value for the first year is the amount that Monaco will demand for a loan.

Now, loans can be of different kinds. Clubs sometimes lend out their young and promising players so that they can get first team football in a different club – something the parent club would not be able to provide. In such loans, clubs expect the players to come back as better players (Daniel Sturridge’s loan from Chelsea to Bolton is one such example) and thus with a higher valuation. Given this expectations, loan fees are usually zero (or even negative – where the parent club continues to bear part of the loanee’s wages).

Another kind of loan is for a player who is on the books but not particularly wanted for the season. It could happen that player’s wages are more than what the club hopes to get in terms of his contribution on the field (implying a negative valuation for the player). In such cases, it is possible for clubs to loan out the player while still covering part of the player’s salary. In that sense, the loan fee paid by the target club is actually negative (since they are in a sense being paid by the parent club to loan the player out). An example of this kind was Andy Carroll’s loan from Liverpool to West Ham United in the 2012-13 season.

Falcao is currently in the prime of his career (aged 29) and heavily injury prone. Given his age and injury record, he is likely to be a fast depreciating asset. By the time he runs out his contract at Monaco (when he will be 33), he is likely to be not worth anything at all. This means that a lion’s share of the value Monaco can derive out of him would be what they would derive in the next one year. This is the primary reason that Monaco have demanded 30% of the four year fee for one year of loan.

Loaning a player also involves some option valuation – based on his performance on loan his valuation at the end of the loan period can either increase or decrease. At the time of loaning out this is a random variable and we can only work on expectations. The thing with Falcao is that given the stage of his career the probability of him being much improved after a year is small. On the other hand, his brittleness means the probability of him being a lesser player is much larger. This ends up depressing the expected valuation at the end of the loan period and thus pushes up the loan fee. Thinking about it, this should have pushed up Falcao’s loan fee above GBP 16M but another factor – that he has just returned from injury and may not be at peak impact for a couple of months has depressed his wages.

Speaking of option valuation, it is possibly the primary reason why young loan signings to lesser clubs come cheap – the possibility of regular first team football increases significantly the expected valuation of the player at the end of the loan period, and this coupled with the fact that the player is not yet proven (which implies a low “base sale price”) drives the loan valuation close to zero.

Loaning is thus a fairly complex process, but players’ valuations can be done in rather economic terms – based on expected contribution in that time period and option valuation. Loaning can also get bizarre at times – Fernando Torres’s move to Milan, for example, has been classified by Chelsea as a “two year loan”, which is funny given that he has two years remaining on his Chelsea contract. It is likely that the deal has been classified as a loan for accounting purposes so that Chelsea do not write off the GBP 50M they paid for Torres’s rights in 2010 too soon.

Why Brazil is undervalued by punters

When India exited the 2007 Cricket World Cup, broadcasters, advertisers and sponsors faced huge losses. They had made the calculations for the tournament based on the assumption that India would qualify for the second group stage, at least, and when India failed to do so, it possibly led to massive losses for these parties.

Back then I had written this blog post where I had explained that one way they could have hedged their exposure to the World Cup would have been by betting against India’s performance. Placing a bet that India would not get out of their World Cup group would have, I had argued, helped mitigate the potential losses coming out of India’s early exist. It is not known if any of them actually hedged their World Cup bets in the betting market.

Looking at the odds in the ongoing Football World Cup, though, it seems like bets are being hedged. The equivalent in the World Cup is Brazil, the home team. While the world football market is reasonably diversified with a large number of teams having a reasonable fan following, the overall financial success of the World Cup depends on Brazil’s performance. An early exit by Brazil (as almost happened on Saturday) can lead to significant financial losses for investors in the tournament, and thus they would like to hedge these bets.

The World Cup simulator is a very interesting website which simulates the remaining games of the World Cup based on a chosen set of parameters (you can choose a linear combination of Elo rating, FIFA ranking, ESPN Soccer Power Index, Home advantage, Players’ Age, Transfer values, etc.). This is achieved by means of a Monte Carlo simulation.

I was looking at this system’s predictions for the Brazil-Colombia quarter final, and comparing that with odds on Betfair (perhaps the most liquid betting site). Based purely on Elo rating, Brazil has a 77% chance of progress. Adding home advantage increases the probability to 80%. The ESPN SPI is not so charitable to Brazil, though – it gives Brazil a 65% chance of progress, which increases to 71% when home advantage is factored in.

Assuming that home advantage is something that cannot be ignored (though the extent of it is questionable for games played at non-traditional venues such as Fortaleza or Manaus), we will take the with home advantage numbers – that gives a 70-80% chance of Brazil getting past Colombia.

So what does Betfair say? As things stand now, a Brazil win is trading at 1.85, which translates to a 54% chance of a Brazil victory.  A draw is trading at 3.8, which translates to a 26% chance. Assuming that teams are equally matched in case of a penalty shootout, this gives Brazil a 67% chance of qualification – which is below the range that is expected based on the SPI and Elo ratings. This discount, I hypothesize, is due to the commercial interest in Brazil’s World Cup performance.

Given that a large number of entities stand to gain from Brazil’s continued progress in the World Cup, they would want to protect their interest by hedging their bets – or by betting against Brazil. While there might be some commercial interest in betting against Colombia (by the Colombian World Cup broadcaster, perhaps?) this interest would be lower than that of the Brazil interest. As a result, the volume of “hedges” by entities with an exposure to Brazil is likely to pull down the “price” of a Brazil win – in other words, it will lead to undervaluation (in the betting market) of the probability that Brazil will win.

So how can you bet on it? There is no easy answer – since the force is acting only one way, there is no real arbitrage opportunity (all betting exchanges are likely to have same prices). The only “trade” here is to go long Brazil – since the “real probability” or progress is probably higher than what is implied by the betting markets. But then you need to know that this is a directional bet contingent upon Brazil’s victory, and need to be careful!

What’s in a shirt number?

There is a traditional way of allotting shirt numbers to football players. “Back to front, right to left”, goes the rule. The goalkeeper is thus number 1. Irrespective of the system used, the right back is number 2, and usually the left forward/winger is number 11.

Now, the way different teams allot numbers depends upon their historical formations, and how their current formations have evolved from those historical formations. The two historical formations that are the 2-3-5 (mostly played in Europe) and the W-M (which originated in South America).

You can read Jonathan Wilson’s excellent Inverting the Pyramid to find more about how formations evolved. This post, however, is about shirt numbers in the ongoing world cup.

Now, given the way numbering has evolved in different countries, each number (between 1 and 11) has a traditional set of roles involved. 1 is the goalie everywhere, 2 is right back everywhere. 3 is left back in Europe, but right centre back in South America. 4 is central midfielder in England, but centre back in Spain and South America. 6 is a central defender in England, left back in Brazil/Argentina and a midfielder in Spain.

These are essentially numbering conventions based on how numbering systems have evolved, but are seldom a rule. However, such conventions are so ingrained in the traditional football watcher’s mind that when a player wears a shirt number that is not normally associated with his position, it appears “wrong”.

For example, William Gallas, a centre back (and occasional right back for Chelsea) by trade moved to Arsenal in 2006 and promptly got the number 10 shirt, which is usually reserved for a central attacker/attacking midfielder (in fact, the number now defines the role – it is simply called the “number 10 role”). In the last season, West Ham used two successive left backs (Razvan Rat and Pablo Armero) last season, and both were allotted number 8 – traditionally allocated to a central midfielder.

In this post, we will look at the squads of the ongoing world cup and try and understand how many players are wearing “wrong” shirt numbers. In order to do this, we look at the most common roles associated with a particular number, and identify any players that don’t fit this convention.

Figures 1 and 2 have the summary of the distribution of roles according to shirt number.




As we can see, all number 1s are goalkeepers (perhaps there is a FIFA rule to this effect). Most number 2s and 3s are defenders, but there is the odd midfielder and forward also who wears this. Iranian forward Khosro Heydari wears 2, as do Greek midfielder Ioannais Maniatis and Bosnian  midfielder Avdija Vrsajevic.

The most unnatural number 3 (in his defence, he’s always worn 3) is Ghanaian striker Asamoah Gyan. Iranian midfielder Ehsan Aji Safi also wears a 3, contrary to convention.

As discussed earlier, midfielders from a few countries wear 4, but there are also two forwards who wear that number – Japanese Keisuke Honda and Australian Tim Cahill. This can be explained by the fact that both of them started off as midfielders, and then turned into forwards, but perhaps wanted to keep their original numbers.

5 is split entirely between defenders and midfielders, who also make up for most of the number 6s. The one exception to this is Russia’s Maksim Kanunnikov, who is a forward. Interestingly, as many as six number 7s (associated with a right winger in both 2-3-5 and W-M systems) are listed as defenders! This includes Colombia’s left back Armero who notoriously wore 8 for West Ham last year. This might possibly be explained by players who started off as wingers and then moved back, but kept their numbers. Two defenders – Costa Rica’s Heiner Mora and Australia’s Bailey Wright wear number 8.

Number 9 is again one of those numbers which is associated with a specific role – a centre forward. In fact, in recent times, there is a variation of this called the “false nine” (there is also a “false ten” now). We would thus expect that all number nines are number nines, but a few midfielders also get that number. Prominent among those is Newcastle’s Cheick Tiote, who wears 9 for Cote D’Ivoire.

10 is split between midfielders and forwards (as expected), but a few defenders wear 11. Croatian captain and right back Darijo Srna wears 11, as also does Greek defender Loukas Vyntra.

Beyond 11, there is no real convention in terms of shirt numbering. The only interesting thing is in the numbers allotted to the reserve goalkeepers (notice that no goalies take any number between 2 and 11). By far, 12 is the most popular number allotted to the reserve goalkeeper, but some teams use 13 as well. Then, 22 and 23 are also pretty popular numbers for goalkeepers.

Finally, we saw that Iran was the culprit in allocating numbers 2 and 3 to non-defenders. Greece, too, came up as a repeat offender in terms of allocating inappropriate numbers. Can we build a “number convention index” and see which countries deviated most from the numbering conventions?

Now, there are degrees in being unconventional, and these need to be accommodated into the analysis. For example, a midfielder wearing 4 (there are 6 of them) is pretty normal, but a forward wearing 3 is simply plain wrong. A forward wearing 8 is not “correct”, but not “wrong” either – this shows that we need more than a simple binary scoring system.

What we will do is to first identify the most common player type for each number, and every such player will get a score of 1. For every other player wearing that number, the score will be the number of such players wearing that number divided by the number of players wearing that number who occupy the most popular position for that number.

I’m assuming the last paragraph didn’t make sense so let me use an example. To use number 2, the most popular position for a number 2 is in defence, so every defender who wears 2 gets 1 point. There are two defenders who wear 2, compared to 29 defenders who wear 2. Thus, each defender who wears 2 gets 2/29 points. One forward wears 2, and he gets 1/29 point.

Taking number 10, the most common position for the number is forward (there are 17 of them), and they all get 1 point. The remaining 15 players who wear 10 are all midfielders, and they get 15/17 points (notice this is not so much less than 1).

This way, each member of each squad gets allotted points based on how “normal” his shirt number is given his position. Summing up the points across players of a team, we get a team score on how “natural” the shirt numbers are. The maximum score a team can get is 23 (each player wearing a number appropriate for his position).

Table 3 here has the team-wise information on correctness of shirt numbers. The team with the worst allocated shirt numbers happens to be Nigeria with 16.13. At the other end, the team that has allocated numbers most appropriately is Ecuador, with 21.

Country  Score
Nigeria         16.13
Costa Rica         16.85
Greece         17.22
Australia         17.45
Iran         17.69
Ivory Coast         17.74
Colombia         18.19
Cameroon         18.19
Argentina         18.29
USA         18.41
Italy         18.63
Portugal         18.66
Algeria         18.68
Honduras         18.69
France         19.02
Ghana         19.06
Croatia         19.08
Netherlands         19.20
Mexico         19.23
Brazil         19.28
Chile         19.37
Japan         19.52
South Korea         19.69
England         19.77
Russia         20.03
Switzerland         20.04
Uruguay         20.25
Spain         20.27
Bosnia & Herzegovina         20.37
Belgium         20.83
Germany         20.86
Ecuador         21.01


This, however, may not tell the complete story. As we saw earlier, conventions regarding numbers between 12 and 23 are not as strict, and thus these numbers can get allocated in a more random fashion compared to 1-11. There are absolutely no taboos related to numbers 12-23, and thus, misallocating them is less of a crime than misallocating 1-11.

Hence, we will look at the numbers 1 to 11, and see how teams have performed. Table 4 has this information:

Country  Score
Australia           7.45
USA           8.04
Iran           8.18
Greece           8.59
Nigeria           8.63
Ivory Coast           8.71
Costa Rica           8.84
Ghana           8.95
Croatia           8.96
Japan           9.06
Colombia           9.27
Brazil           9.29
Spain           9.66
Bosnia & Herzegovina           9.67
Uruguay           9.70
Honduras           9.71
Portugal           9.78
Italy           9.81
South Korea           9.83
Cameroon           9.83
Chile           9.87
Russia           9.94
England           9.97
Argentina         10.03
Switzerland         10.18
Netherlands         10.24
Algeria         10.30
Ecuador         10.34
Mexico         10.35
France         10.53
Belgium         10.88
Germany         11.00


Germany has a “perfect” first eleven, in terms of number allocation. Belgium comes close. At the other end of the scale, we have Australia, which seems to have the most misallocated 1-11 shirt numbers. Iran and Greece, which we anecdotally saw as having high misallocations are at three and four, with the United States at 2.

Note: The data is taken from the Guardian Data Blog. Now, this analysis should be taken with some salt since in the modern game, the division of players into “defender”, “midfielder” and “forward” is not straightforward. Where would you put a “classic number ten”? What about a wing back? And so forth.

Money can buy me Premier League performance

The following graph plots the premier league performance (in terms of points) for the 2012-13 season as a function of the team’s wage bill. Apart from a few outliers here and there the correlation is astounding:



The red line is the line of best fit (according to a linear regression) and comparing team standings with respect to the line shows how well teams performed relative to what their wages would predict.

It is interesting to see that Manchester City almost fall off the charts in terms of wages, yet they could not translate this to on-pitch performance. It can also be seen that Manchester United, Spurs and Everton significantly over-performed given their wage bills.

Based on the wage bill, it would have also been reasonably easy to predict that Wigan Athletic and Reading would get relegated at the end of the season – though it must be mentioned they underperformed their wage bills, but QPR should have done a lot better given the size of their pay packet.

A simple linear regression of points against wage bill shows that every GBP 4 million increase in the wage bill leads to one additional point in the premier league! And the regression has an R-square of 69% – which means that the team’s wage bill can predict 69% of the variation in the team’s performance! Which is extremely significant.

The screenshot of the regression is given below: wagerank


Note that in this post we only use the wage bill and not any transfer fees paid. However, the assumption is that the two are reasonably correlated and we are not losing out on any information by using only the wage bill.



Classifying cricket grounds

For some work I’m trying to classify cricket grounds. The question is if we can classify cricket grounds based on what kind of cricket they support. Some pitches are slow and low – it is hard to score runs, but also hard to get the batsman out. Some others are fast and bouncy – easy to score and easy to get out. Then you have the “batting pitches” – easy to score and hard to get out and “bowling pitches” – hard to score but easy to take wickets.

Essentially I’m trying to see if I can classify a ground into one of the above four regimes (or a superposition of them) at different stages in a game – this will help estimate how the rest of the game is going to play out.

For this, I was looking at the runs per ball and balls per wicket statistic for a number of grounds based on T20 matches. All grounds which hosted over 10 T20 matches (international or IPL) before the 10th of April have been considered for this analysis. It is interesting, to say the least.

Here is the scatter plot – bottom right (only the Oval) is easy to score, easy to get out. Top right are the batting pitches, bottom left the bowling pitches and top left the slow-and-low! It is interesting that the “most bowling pitch” of the lot is Chittagong! The only Indian ground that can be classified thus is DY Patil Sports Academy in Navi Mumbai!