## Simulating Covid-19 Scenarios

I must warn that this is a super long post. Also I wonder if I should put this on medium in order to get more footage.

Most models of disease spread use what is known as a “SIR” framework. This Numberphile video gives a good primer into this framework.

The problem with the framework is that it’s too simplistic. It depends primarily on one parameter “R0”, which is the average number of people that each infected patient infects. When R0 is high, each patient infects a number of other people, and the disease spreads fast. With a low R0, the disease spreads slow. It was the SIR model that was used to produce all those “flatten the curve” pictures that we were bombarded with a week or two back.

There is a second parameter as well – the recovery or removal rate. Some diseases are so lethal that they have a high removal rate (eg. Ebola), and this puts a natural limit on how much the disease can spread, since infected people die before they can infect too many people.

In any case, such modelling is great for academic studies, and post-facto analyses where R0 can be estimated. As we are currently in the middle of an epidemic, this kind of simplistic modelling can’t take us far. Nobody has a clue yet on what the R0 for covid-19 is. Nobody knows what proportion of total cases are asymptomatic. Nobody knows the mortality rate.

And things are changing well-at-a-faster-rate. Governments are imposing distancing of various forms. First offices were shut down. Then shops were shut down. Now everything is shut down, and many of us have been asked to step out “only to get necessities”. And in such dynamic and fast-changing environments, a simplistic model such as the SIR can only take us so far, and uncertainty in estimating R0 means it can be pretty much useless as well.

In this context, I thought I’ll simulate a few real-life situations, and try to model the spread of the disease in these situations. This can give us an insight into what kind of services are more dangerous than others, and how we could potentially “get back to life” after going through an initial period of lockdown.

The basic assumption I’ve made is that the longer you spend with an infected person, the greater the chance of getting infected yourself. This is not an unreasonable assumption because the spread happens through activities such as sneezing, touching, inadvertently dropping droplets of your saliva on to the other person, and so on, each of which is more likely the longer the time you spend with someone.

Some basic modelling revealed that this can be modelled as a sort of negative exponential curve that looks like this.

$p = 1 - e^{-\lambda T}$

T is the number of hours you spend with the other person. $\lambda$ is a parameter of transmission – the higher it is, the more likely the disease with transmit (holding the amount of time spent together constant).

The function looks like this:

We have no clue what $\lambda$ is, but I’ll make an educated guess based on some limited data I’ve seen. I’ll take a conservative estimate and say that if an uninfected person spends 24 hours with an infected person, the former has a 50% chance of getting the disease from the latter.

This gives the value of $\lambda$ to be 0.02888 per hour. We will now use this to model various scenarios.

1. #### Delivery

This is the simplest model I built. There is one shop, and N customers.  Customers come one at a time and spend a fixed amount of time (1 or 2 or 5 minutes) at the shop, which has one shopkeeper. Initially, a proportion $p$ of the population is infected, and we assume that the shopkeeper is uninfected.

And then we model the transmission – based on our $\lambda = 0.02888$, for a two minute interaction, the probability of transmission is $1 - e^{-\lambda T} = 1 - e^{-\frac{0.02888 * 2}{60}} ~= 0.1$%.

In hindsight, I realised that this kind of a set up better describes “delivery” than a shop. With a 0.1% probability the delivery person gets infected from an infected customer during a delivery. With the same probability an infected delivery person infects a customer. The only way the disease can spread through this “shop” is for the shopkeeper / delivery person to be uninfected.

How does it play out? I simulated 10000 paths where one guy delivers to 1000 homes (maybe over the course of a week? that doesn’t matter as long as the overall infected rate in the population otherwise is constant), and spends exactly two minutes at each delivery, which is made to a single person. Let’s take a few cases, with different base cases of incidence of the disease – 0.1%, 0.2%, 0.5%, 1%, 2%, 5%, 10%, 20% and 50%.

The number of NEW people infected in each case is graphed here (we don’t care how many got the disease otherwise. We’re modelling how many got it from our “shop”). The  right side graph excludes the case of zero new infections, just to show you the scale of the problem.

Notice this – even when 50% of the population is infected, as long as the shopkeeper or delivery person is not initially infected, the chances of additional infections through 2-minute delivery are MINUSCULE. A strong case for policy-makers to enable delivery of all kinds, essential or inessential.

#### 2. SHOP

Now, let’s complicate matters a little bit. Instead of a delivery person going to each home, let’s assume a shop. Multiple people can be in the shop at the same time, and there can be more than one shopkeeper.

Let’s use the assumptions of standard queueing theory, and assume that the inter-arrival time for customers is guided by an Exponential distribution, and the time they spend in the shop is also guided by an Exponential distribution.

At the time when customers are in the shop, any infected customer (or shopkeeper) inside can infect any other customer or shopkeeper. So if you spend 2 minutes in a shop where there is 1 infected person, our calculation above tells us that you have a 0.1% chance of being infected yourself. If there are 10 infected people in the shop and you spend 2 minutes there, this is akin to spending 20 minutes with one infected person, and you have a 1% chance of getting infected.

Let’s consider two or three scenarios here. First is the “normal” case where one customer arrives every 5 minutes, and each customer spends 10 minutes in the shop (note that the shop can “serve” multiple customers simultaneously, so the queue doesn’t blow up here). Again let’s take a total of 1000 customers (assume a 24/7 open shop), and one shopkeeper.

Notice that there is significant transmission of infection here, even though we started with 5% of the population being infected. On average, another 3% of the population gets infected! Open supermarkets with usual crowd can result in significant transmission.

Does keeping the shop open with some sort of social distancing (let’s see only one-fourth as many people arrive) work? So people arrive with an average gap of 20 minutes, and still spend 10 minutes in the shop. There are still 10 shopkeepers. What does it look like when we start with 5% of the people being infected?

The graph is pretty much identical so I’m not bothering to put that here!

#### 3. Office

This scenario simulates for N people who are working together for a certain number of hours. We assume that exactly one person is infected at the beginning of the meeting. We also assume that once a person is infected, she can start infecting others in the very next minute (with our transmission probability).

How does the infection grow in this case? This is an easier simulation than the earlier one so we can run 10000 Monte Carlo paths. Let’s say we have a “meeting” with 40 people (could just be 40 people working in a small room) which lasts 4 hours. If we start with one infected person, this is how the number of infected grows over the 4 hours.

The spread is massive! When you have a large bunch of people in a small closed space over a significant period of time, the infection spreads rapidly among them. Even if you take a 10 person meeting over an hour, one infected person at the start can result in an average of 0.3 other people being infected by the end of the meeting.

10 persons meeting over 8 hours (a small office) with one initially infected means 3.5 others (on average) being infected by the end of the day.

Offices are dangerous places for the infection to spread. Even after the lockdown is lifted, some sort of work from home regulations need to be in place until the infection has been fully brought under control.

#### 4. Conferences

This is another form of “meeting”, except that at each point in time, people don’t engage with the whole room, but only a handful of others. These groups form at random, changing every minute, and infection can spread only within a particular group.

Let’s take a 100 person conference with 1 initially infected person. Let’s assume it lasts 8 hours. Depending upon how many people come together at a time, the spread of the infection rapidly changes, as can be seen in the graph below.

If people talk two at a time, there’s a 63% probability that the infection doesn’t spread at all. If they talk 5 at a time, this probability is cut by half. And if people congregate 10 at a time, there’s only a 11% chance that by the end of the day the infection HASN’T propagated!

One takeaway from this is that even once offices start functioning, they need to impose social distancing measures (until the virus has been completely wiped out). All large-ish meetings by video conference. A certain proportion of workers working from home by rotation.

And I wonder what will happen to the conferences.

I’ve put my (unedited) code here. Feel free to use and play around.

Finally, you might wonder why I’ve made so many Monte Carlo Simulations. Well, as the great Matt Levine had himself said, that’s my secret sauce!

## Distribution of political values

Through Baal on Twitter I found this “Political Compass” survey. I took it, and it said this is my “political compass”.

Now, I’m not happy with the result. I mean, I’m okay with the average value where the red dot has been put for me, and I think that represents my political leanings rather well. However, what I’m unhappy about is that my political views have been all reduced to one single average point.

I’m pretty sure that based on all the answers I gave in the survey, my political leaning across both the two directions follows a distribution, and the red dot here is only the average (mean, I guess, but could also be median) value of that distribution.

However, there are many ways in which people can have a political view that lands right on my dot – some people might have a consistent but mild political view in favour of or against a particular position. Others might have pretty extreme views – for example, some of my answers might lead you to believe that I’m an extreme right winger, and others might make me look like a Marxist (I believe I have a pretty high variance on both axes around my average value).

So what I would have liked instead from the political compass was a sort of heat map, or at least two marginal distributions, showing how I’m distributed along the two axes, rather than all my views being reduced to one average value.

A version of this is the main argument of this book I read recently called “The End Of Average“. That when we design for “the average man” or “the average customer”, and do so across several dimensions,  we end up designing for nobody, since nobody is average when looked at on many dimensions.

## Statistical analysis revisited – machine learning edition

Over ten years ago, I wrote this blog post that I had termed as a “lazy post” – it was an email that I’d written to a mailing list, which I’d then copied onto the blog. It was triggered by someone on the group making an off-hand comment of “doing regression analysis”, and I had set off on a rant about why the misuse of statistics was a massive problem.

Ten years on, I find the post to be quite relevant, except that instead of “statistics”, you just need to say “machine learning” or “data science”. So this is a truly lazy post, where I piggyback on my old post, to talk about the problems with indiscriminate use of data and models.

there is this popular view that if there is data, then one ought to do statistical analysis, and draw conclusions from that, and make decisions based on these conclusions. unfortunately, in a large number of cases, the analysis ends up being done by someone who is not very proficient with statistics and who is basically applying formulae rather than using a concept. as long as you are using statistics as concepts, and not as formulae, I think you are fine. but you get into the “ok i see a time series here. let me put regression. never mind the significance levels or stationarity or any other such blah blah but i’ll take decisions based on my regression” then you are likely to get into trouble.

The modern version of this is – everybody wants to do “big data” and “data science”. So if there is some data out there, people will want to draw insights from it. And since it is easy to apply machine learning models (thanks to open source toolkits such as the scikit-learn package in Python), people who don’t understand the models indiscriminately apply it on the data that they have got. So you have people who don’t really understand data or machine learning working with those, and creating models that are dangerous.

As long as people have idea of the models they are using, and the assumptions behind them, and the quality of data that goes into the models, we are fine. However, we are increasingly seeing cases of people using improper or biased data and applying models they don’t understand on top of them, that will have impact that affect the wider world.

So the problem is not with “artificial intelligence” or “machine learning” or “big data” or “data science” or “statistics”. It is with the people who use them incorrectly.

## Big Data and Fast Frugal Trees

In his excellent podcast episode with EconTalk’s Russ Roberts, psychologist Gerd Gigerenzer introduces the concept of “fast and frugal trees“. When someone needs to make decisions quickly, Gigerenzer says, they don’t take into account a large number of factors, but instead rely on a small set of thumb rules.

The podcast itself is based on Gigerenzer’s 2009 book Gut Feelings. Based on how awesome the podcast was, I read the book, but found that it didn’t offer too much more than what the podcast itself had to offer.

Coming back to fast and frugal trees..

In recent times, ever since “big data” became a “thing” in the early 2010s, it is popular for companies to tout the complexity of their decision algorithms, and machine learning systems. An easy way for companies to display this complexity is to talk about the number of variables they take into account while making a decision.

For example, you can have “fin-tech” lenders who claim to use “thousands of data points” on their prospective customers’ histories to determine whether to give out a loan. A similar number of data points is used to evaluate resumes and determine if a candidate should be called for an interview.

With cheap data storage and compute power, it has become rather fashionable to “use all the data available” and build complex machine learning models (which aren’t that complex to build) for decisions that were earlier made by humans. The problem with this is that this can sometimes result in over-fitting (system learning something that it shouldn’t be learning) which can lead to disastrous predictive power.

In his podcast, Gigerenzer talks about fast and frugal trees, and says that humans in general don’t use too many data points to make their decisions. Instead, for each decision, they build a quick “fast and frugal tree” and make their decision based on their gut feelings about a small number of data points. What data points to use is determined primarily based on their experience (not cow-like experience), and can vary by person and situation.

The advantage of fast and frugal trees is that the model is simple, and so has little scope for overfitting. Moreover, as the name describes, the decision process is rather “fast”, and you don’t have to collect all possible data points before you make a decision. The problem with productionising the fast and frugal tree, however, is that each user’s decision making process is different, and about how we can learn that decision making process to make the most optimal decisions at a personalised level.

How you can learn someone’s decision-making process (when you’ve assumed it’s a fast and frugal tree) is not trivial, but if you can figure it out, then you can build significantly superior recommender systems.

If you’re Netflix, for example, you might figure that someone makes their movie choices based only on age of movie and its IMDB score. So their screen is customised to show just these two parameters. Someone else might be making their decisions based on who the lead actors are, and they need to be shown that information along with the recommendations.

Another book I read recently was Todd Rose’s The End of Average. The book makes the powerful point that nobody really is average, especially when you’re looking a large number of dimensions, so designing for average means you’re designing for nobody.

I imagine that is one reason why a lot of recommender systems (Netflix or Amazon or Tinder) fail is that they model for the average, building one massive machine learning system, rather than learning each person’s fast and frugal tree.

The latter isn’t easy, but if it can be done, it can result in a significantly superior user experience!

## Liverpool FC: Mid Season Review

After 20 games played, Liverpool are sitting pretty on top of the Premier League with 58 points (out of a possible 60). The only jitter in the campaign so far came in a draw away at Manchester United.

I made what I think is a cool graph to put this performance in perspective. I looked at Liverpool’s points tally at the end of the first 19 match days through the length of the Premier League, and looked at “progress” (the data for last night’s win against Sheffield isn’t yet up on my dataset, which also doesn’t include data for the 1992-93 season, so those are left out).

Given the strength of this season’s performance, I don’t think there’s that much information in the graph, but here it goes in any case:

I’ve coloured all the seasons where Liverpool were the title contenders. A few things stand out:

1. This season, while great, isn’t that much better than the last one. Last season, Liverpool had three draws in the first half of the league (Man City at home, Chelsea away and Arsenal away). It was the first month of the second half where the campaign faltered (starting with the loss to Man City).
2. This possibly went under the radar, but Liverpool had a fantastic start to the 2016-17 season as well, with 43 points at the halfway stage. To put that in perspective, this was one more than the points total at that stage in the title-chasing 2008-9 season.
3. Liverpool went close in 2013-14, but in terms of points, the halfway performance wasn’t anything to write home about. That was also back in the time when teams didn’t dominate like nowadays, and eighty odd points was enough to win the league.

This is what Liverpool’s full season looked like (note that I’ve used a different kind of graph here. Not sure which one is better).

Finally, what’s the relationship between points at the end of the first half of the season (19 games) and the full season? Let’s run a regression across all teams, across all 38 game EPL seasons.

The regression doesn’t turn out to be THAT significant, with an R Squared of 41%. In other words, a team’s points tally at the halfway point in the season explains less than 50% of the variation in the points tally that the team will get in the second half of the season.

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  9.42967    0.97671   9.655   <2e-16 ***
Midway       0.64126    0.03549  18.070   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 6.992 on 478 degrees of freedom
(20 observations deleted due to missingness)
Multiple R-squared:  0.4059,    Adjusted R-squared:  0.4046
F-statistic: 326.5 on 1 and 478 DF,  p-value: < 2.2e-16


The interesting thing is that the coefficient of the midway score is less than 1, which implies that teams’ performances at the end of the season (literally) regress to the mean.

55 points at the end of the first 19 games is projected to translate to 100 at the end of the season. In fact, based on this regression model run on the first 19 games of the season, Liverpool should win the title by a canter.

PS: Look at the bottom of this projections table. It seems like for the first time in a very long time, the “magical” 40 points might be necessary to stave off relegation. Then again, it’s regression (pun intended).

## This year on Spotify

I’m rather disappointed with my end-of-year Spotify report this year. I mean, I know it’s automated analytics, and no human has really verified it, etc.  but there are some basics that the algorithm failed to cover.

The first few slides of my “annual report” told me that my listening changed by seasons. That in January to March, my favourite artists were Black Sabbath and Pink Floyd, and from April to June they were Becky Hill and Meduza. And that from July onwards it was Sigala.

Now, there was a life-changing event that happened in late March which Spotify knows about, but failed to acknowledge in the report – I moved from the UK to India. And in India, Spotify’s inventory is far smaller than it is in the UK. So some of the bands I used to listen to heavily in the UK, like Black Sabbath, went off my playlist in India. My daughter’s lullaby playlist, which is the most consumed music for me, moved from Spotify to Amazon Music (and more recently to Apple Music).

The other thing with my Spotify use-case is that it’s not just me who listens to it. I share the account with my wife and daughter, and while I know that Spotify has an algorithm for filtering out kid stuff, I’m surprised it didn’t figure out that two people are sharing this account (and pitched us a family subscription).

According to the report, these are the most listened to genres in 2019:

Now there are two clear classes of genres here. I’m surprised that Spotify failed to pick it out. Moreover, the devices associated with my account that play Rock or Power Metal are disjoint from the devices that play Pop, EDM or House. It’s almost like Spotify didn’t want to admit that people share accounts.

Then some three slides on my podcast listening for the year, when I’ve overall listened to five hours of podcasts using Spotify. If I, a human, were building this report, I would have dropped this section citing insufficient data, rather than wasting three slides with analytics that simply don’t make sense.

I see the importance of this segment in Spotify’s report, since they want to focus more on podcasts (being an “audio company” rather than a “music company”), but maybe something in the report to encourage me to use Spotify for more podcasts (maybe recommending Spotify’s exclusive podcasts that I might like, be it based on limited data?) might have helped.

Finally, take a look at my our most played songs in 2019.

It looks like my daughter’s sleeping playlist threaded with my wife’s favourite songs (after a point the latter dominate). “My songs” are nowhere to be found – I have to go all the way down to number 23 to find Judas Priest’s cover of Diamonds and Rust. I mean I know I’ve been diversifying the kind of music that I listen to, while my wife listens to pretty much the same stuff over and over again!

In any case, automated analytics is all fine, but there are some not-so-edge cases where the reports that it generates is obviously bad. Hopefully the people at Spotify will figure this out and use more intelligence in producing next year’s report!

## Spurs right to sack Pochettino?

A few months back, I built my “football club elo by manager” visualisation. Essentially, we take the week-by-week Premier League Elo ratings from ClubElo and overlay it with managerial tenures.

A clear pattern emerges – a lot of Premier League sackings have been consistent with clubs going down significantly in terms of Elo Ratings. For example, we have seen that Liverpool sacked Rafa Benitez, Kenny Dalglish (in 2012) and Brendan Rodgers all at the right time, and that similarly Manchester United sacked Jose Mourinho when he brought them back to below where he started.

And now the news comes in that Spurs have joined the party, sacking long-time coach Mauricio Pochettino. What I find interesting is the timing of the sacking – while international breaks are usually a popular time to change managers (the two week gap in fixtures gives a club some time to adjust), most sackings happen in the first week of the international break.

The Pochettino sacking is surprising in that it has come towards the end of the international break, giving the club four days before their next fixture (a derby at the struggling West Ham). However, the Guardian reports that Spurs are close to hiring Jose Mourinho, and that might explain the timing of the sacking.

So were Spurs right in sacking Pochettino, barely six months after he took them to a Champions League final? Let’s look at the Spurs story under Pochettino using Elo ratings.

Pochettino took over in 2014 after an underwhelming 2013-14 when the club struggled under Andre Villas Boas and then Tim Sherwood. Initially, results weren’t too promising, as he took them from a 1800 rating down to 1700.

However, chairman Daniel Levy’s patience paid off, and the club mounted a serious challenge to Leicester in the 2015-16 season before falling away towards the end of the season, finishing third behind Arsenal. As the Elo shows, the improvement continued, as the club remained in Champions League places through the course of Pochettino’s reign.

Personally, the “highlight” of Pochettino’s reign was Spurs’ 4-1 demolition of Liverpool at Wembley in October 2017, a game I happened to watch at the stadium. And as per the Elo ratings the club plateaued shortly after that.

If that plateau had continued,  I suppose Pochettino would have remained in his job, giving the team regular Champions League football. This season, however, has been a disaster.

Spurs are 13 points below what they had scored in comparable fixtures last season, and unlikely to finish in the top six even. Their Elo has also dropped below 1850 for the first time since 2016-17. While that is still higher than where Pochettino started off at, the precipitous drop in recent times has meant that the club has possibly taken the right call in sacking Pochettino.

If Mourinho does replace him (it looks likely, as per the Guardian), it will present a personal problem for me – for over a decade now, Tottenham have been my “second team” in the top half of the Premier League, behind Liverpool. That cannot continue if Mourinho takes over. I’m wondering who to shift my allegiance to – it will have to be either Leicester or (horror of horrors) Chelsea!

## Alchemy

Over the last 4-5 days I kinda immersed myself in finishing Rory Sutherland’s excellent book Alchemy.

It all started with a podcast, with Sutherland being the guest on Russ Roberts’ EconTalk last week. I’d barely listened to half the podcast when I knew that I wanted more of Sutherland, and so immediately bought the book on Kindle. The same evening, I finished my previous book and started reading this.

Sometimes I get a bit concerned that I’m agreeing with an author too much. What made this book “interesting” is that Sutherland is an ad-man and a marketer, and keeps talking down on data and economics, and plays up intuition and “feeling”. In other words, at least as far as professional career and leanings go, he is possibly as far from me as it gets. Yet, I found myself silently nodding in agreement as I went through the book.

If I have to summarise the book in one line I would say, “most decisions are made intuitively or based on feeling. Data and logic are mainly used to rationalise decisions rather than making them”.

And if you think about it, it’s mostly true. For example, you don’t use physics to calculate how much to press down on your car accelerator while driving – you do it essentially by trial and error and using your intuition to gauge the feedback. Similarly, a ball player doesn’t need to know any kinematics or projectile motion to know how to throw or hit or catch a ball.

The other thing that Sutherland repeatedly alludes to is that we tend to try and optimise things that are easy to measure or optimise. Financials are a good example of that. This decade, with the “big data revolution” being followed by the rise of “data science”, the amount of data available to make decisions has been endless, meaning that more and more decisions are being made using data.

The trouble, of course, is availability bias, or what I call as the “keys-under-lamppost bias”. We tend to optimise and make decisions on things that are easily measurable (this set of course is now much larger than it was a decade ago), and now that we know we are making use of more objective stuff, we have irrational confidence in our decisions.

Sutherland talks about barbell strategies, ergodicity, why big data leads to bullshit, why it is important to look for solutions beyond the scope of the immediate domain and the Dunning-Kruger effect. He makes statements such as “I would rather run a business with no mathematicians than with second-rate mathematicians“, which exactly mirrors my opinion of the “data science industry”.

There is absolutely no doubt why I liked the book.

Thinking again, while I said that professionally Sutherland seems as far from me as possible, it’s possibly not so true. While I do use a fair bit of data and economic analysis as part of my consulting work, I find that I make most of my decisions finally on intuition. Data is there to guide me, but the decision-making is always an intuitive process.

In late 2017, when I briefly worked in an ill-fated job in “data science”, I’d made a document about the benefits of combining data analysis with human insight. And if I think about my work, my least favourite work is where I’ve done work with data to help clients make “logical decision” (as Sutherland puts it).

The work I’ve enjoyed the most has been where I’ve used the data and presented it in ways in which my clients and I have noticed patterns, rationalised them and then taken a (intuitive) leap of faith into what the right course of action may be.

And this also means that over time I’ve been moving away from work that involves building models (the output is too “precise” to interest me), and take on more “strategic” stuff where there is a fair amount of intuition riding on top of the data.

Back to the book, I’m so impressed with it that in case I was still living in London, I would have pestered Sutherland to meet me, and then tried to convince him to let me work for him. Even if at the top level it seems like his work and mine are diametrically opposite..

I leave you with my highlights and notes from the book, and this tweet.

Here’s my book, in case you are interested.

## EPL: Mid-Season Review

Going into the November international break, Liverpool are eight points ahead at the top of the Premier League. Defending champions Manchester City have slipped to fourth place following their loss to Liverpool. The question most commentators are asking is if Liverpool can hold on to this lead.

We are two-thirds of the way through the first round robin of the premier league. The thing with evaluating league standings midway through the round robin is that it doesn’t account for the fixture list. For example, Liverpool have finished playing the rest of the “big six” (or seven, if you include Leicester), but Manchester City have many games to go among the top teams.

So my practice over the years has been to compare team performance to corresponding fixtures in the previous season, and to look at the points difference. Then, assuming the rest of the season goes just like last year, we can project who is likely to end up where.

Now, relegation and promotion introduces a source of complication, but we can “solve” that by replacing last season’s relegated teams with this season’s promoted teams (18th by Championship winners, 19th by Championship runners-up, and 20th by Championship playoff winners).

It’s not the first time I’m doing this analysis. I’d done it once in 2013-14, and once in 2014-15. You will notice that the graphs look similar as well – that’s how lazy I am.

Anyways, this is the points differential thus far compared to corresponding fixtures of last season.

Leicester are the most improved team from last season, having scored 8 points more than in corresponding fixtures from last season. Sheffield United, albeit starting from a low base, have done extremely well as well. And last season’s runners-up Liverpool are on a plus 6.

The team that has done worst relative to last season is Tottenham Hotspur, at minus 13. Key players entering the final years of their contract and not signing extensions, and scanty recruitment over the last 2-3 years, haven’t helped. And then there is Manchester City at minus 9!

So assuming the rest of the season’s fixtures go according to last season’s corresponding fixtures, what will the final table look  like at the end of the season?
We see that if Liverpool replicate their results from last season for the rest of the fixtures, they should win the league comfortably.

What is more interesting is the gaps between 1-2, 2-3 and 3-4. Each of the top three positions is likely to be decided “comfortably”, with a fairly congested mid-table.

As mentioned earlier, this kind of analysis is unfair to the promoted teams. It is highly unlikely that Sheffield will get relegated based on the start they’ve had.

We’ll repeat this analysis after a couple of months to see where the league stands!

## Segmentation and machine learning

For best results, use machine learning to do customer segmentation, but then get humans with domain knowledge to validate the segments

There are two common ways in which people do customer segmentation. The “traditional” method is to manually define the axes through which the customers will get segmented, and then simply look through the data to find the characteristics and size of each segment.

Then there is the “data science” way of doing it, which is to ignore all intuition, and simply use some method such as K-means clustering and “do gymnastics” with the data and find the clusters.

A quantitative extreme of this method is to do gymnastics with your data, get segments out of it, and quantitatively “take action” on it without really bothering to figure out what each clusters represent. Loosely speaking, this is how a lot of recommendation systems nowadays work – some algorithm somewhere finds people similar to you based on your behaviour, and recommends to you what they liked.

I usually prefer a sort of middle ground. I like to let the algorithms (k-means easily being my favourite) to come up with the segments based on the data, and then have a bunch of humans look at the segments and make sense of it.

Basically whatever segments are thrown up by the algorithm need to be validated by human intuition. Getting counterintuitive clusters is also not a problem – on several occasions, people I’ve validated the clusters by (usually clients) have used the counterintuitive clusters to discover bugs, gaps in the data  or patterns that they didn’t know of earlier.

Also, in terms of validation of clusters, it is always useful to get people with domain knowledge to validate the clusters. And this also means that whatever clusters you’ve generated you are able to represent them in a human-readable format. The best way of doing that is to use the cluster centres and then represent them somehow in a “physical” manner.

I started writing this post some three days ago and am only getting to finish it now. Unfortunately, in the meantime I’ve forgotten the exact motivation of why I started writing this. If i recall that, I’ll maybe do another post.