Communicating binary forecasts

One silver lining in the madness of the US Presidential election counting is that there are some interesting analyses floating around regarding polling and surveying and probabilities and visualisation. Take this post from Andrew Gelman’s blog, for example:

Suppose our forecast in a certain state is that candidate X will win 0.52 of the two-party vote, with a forecast standard deviation of 0.02. Suppose also that the forecast has a normal distribution.[…]

Then your 68% predictive interval for the candidate’s vote share is [0.50, 0.54], and your 95% interval is [0.48, 0.56].

Now suppose the candidate gets exactly half of the vote. Or you could say 0.499, the point being that he lost the election in that state.

This outcome falls on the boundary of the 68% interval, it’s one standard deviation away from the forecast. In no sense would this be called a prediction error or a forecast failure.

But now let’s say it another way. The forecast gave the candidate an 84% chance of winning! And then he lost. That’s pretty damn humiliating. The forecast failed.

It took me a while to appreciate this. In a binary outcome, if your model says predicts 52%, with a standard deviation of 2%, you are in effect predicting a “win” (50% or higher) with a probability of 84%! Somehow I had never thought about it that way.

In any case, this tells you how tricky forecasting a binary outcome is. You might think (based on your sample size) that a 2% standard deviation is reasonable. Except that when the mean of your forecast is close to the barrier (50% in this case), the “reasonable standard deviation” lends a much stronger meaning to your forecast.

Gelman goes on:

That’s right. A forecast of 0.52 +/- 0.02 gives you an 84% chance of winning.

We want to increase the sd in the above expression so as to send the win probability down to 60%. How much do we need to increase it? Maybe send it from 0.02 to 0.03?

> pnorm(0.52, 0.50, 0.03)
[1] 0.75

Uh, no, that wasn’t enough! 0.04?

> pnorm(0.52, 0.50, 0.04)
[1] 0.69

0.05 won’t do it either. We actually have to go all the way up to . . . 0.08:

> pnorm(0.52, 0.50, 0.08)
[1] 0.60

That’s right. If your best guess is that candidate X will receive 0.52 of the vote, and you want your forecast to give him a 60% chance of winning the election, you’ll have to ramp up the sd to 0.08, so that your 95% forecast interval is a ridiculously wide 0.52 +/- 2*0.08, or [0.36, 0.68].

Who said forecasting an election is easy?

 

What is the Case Fatality Rate of Covid-19 in India?

The economist in me will give a very simple answer to that question – it depends. It depends on how long you think people will take from onset of the disease to die.

The modeller in me extended the argument that the economist in me made, and built a rather complicated model. This involved smoothing, assumptions on probability distributions, long mathematical derivations and (for good measure) regressions.. And out of all that came this graph, with the assumption that the average person who dies of covid-19 dies 20 days after the thing is detected.

 

Yes, there is a wide variation across the country. Given that the disease is the same and the treatment for most people diseased is pretty much the same (lots of rest, lots of water, etc), it is weird that the case fatality rate varies by so much across Indian states. There is only one explanation – assuming that deaths can’t be faked or miscounted (covid deaths attributed to other reasons or vice versa), the problem is in the “denominator” – the number of confirmed cases.

What the variation here tells us is that in states towards the top of this graph, we are likely not detecting most of the positive cases (serious cases will get themselves tested anyway, and get hospitalised, and perhaps die. It’s the less serious cases that can “slip”). Taking a state low down below in this graph as a “good tester” (say Andhra Pradesh), we can try and estimate what the extent of under-detection of cases in each state is.

Based on state-wise case tallies as of now (might be some error since some states might have reported today’s number and some mgiht not have), here are my predictions on how many actual number of confirmed cases there are per state, based on our calculations of case fatality rate.

Yeah, Maharashtra alone should have crossed a million caess based on the number of people who have died there!

Now let’s get to the maths. It’s messy. First we look at the number of confirmed cases per day and number of deaths per day per state (data from here). Then we smooth the data and take 7-day trailing moving averages. This is to get rid of any reporting pile-ups.

Now comes the probability assumption – we assume that a proportion p of all the confirmed cases will die. We assume an average number of days (N) to death for people who are supposed to die (let’s call them Romeos?). They all won’t pop off exactly N days after we detect their infection. Let’s say a proportion \lambda dies each day. Of everyone who is infected, supposed to die and not yet dead, a proportion \lambda will die each day.

My maths has become rather rusty over the years but a derivation I made shows that \lambda = \frac{1}{N}. So if people are supposed to die in an average of 20 days, \frac{1}{20} will die today, \frac{19}{20}\frac{1}{20} will die tomorrow. And so on.

So people who die today could be people who were detected with the infection yesterday, or the day before, or the day before day before (isn’t it weird that English doesn’t a word for this?) or … Now, based on how many cases were detected on each day, and our assumption of p (let’s assume a value first. We can derive it back later), we can know how many people who were found sick k days back are going to die today. Do this for all k, and you can model how many people will die today.

The equation will look something like this. Assume d_t is the number of people who die on day t and n_t is the number of cases confirmed on day t. We get

d_t = p  (\lambda n_{t-1} + (1-\lambda) \lambda n_{t-2} + (1-\lambda)^2 \lambda n_{t-3} + ... )

Now, all these ns are known. d_t is known. \lambda comes from our assumption of how long people will, on average, take to die once their infection has been detected. So in the above equation, everything except p is known.

And we have this data for multiple days. We know the left hand side. We know the value in brackets on the right hand side. All we need to do is to find p, which I did using a simple regression.

And I did this for each state – take the number of confirmed cases on each day, the number of deaths on each day and your assumption on average number of days after detection that a person dies. And you can calculate p, which is the case fatality rate. The true proportion of cases that are resulting in deaths.

This produced the first graph that I’ve presented above, for the assumption that a person, should he die, dies on an average 20 days after the infection is detected.

So what is India’s case fatality rate? While the first graph says it’s 5.8%, the variations by state suggest that it’s a mild case detection issue, so the true case fatality rate is likely far lower. From doing my daily updates on Twitter, I’ve come to trust Andhra Pradesh as a state that is testing well, so if we assume they’ve found all their active cases, we use that as a base and arrive at the second graph in terms of the true number of cases in each state.

PS: It’s common to just divide the number of deaths so far by number of cases so far, but that is an inaccurate measure, since it doesn’t take into account the vintage of cases. Dividing deaths by number of cases as of a fixed point of time in the past is also inaccurate since it doesn’t take into account randomness (on when a Romeo might die).

Anyway, here is my code, for what it’s worth.

deathRate <- function(covid, avgDays) {
covid %>%
mutate(Date=as.Date(Date, '%d-%b-%y')) %>%
gather(State, Number, -Date, -Status) %>%
spread(Status, Number) %>%
arrange(State, Date) -> 
cov1

# Need to smooth everything by 7 days 
cov1 %>%
arrange(State, Date) %>%
group_by(State) %>%
mutate(
TotalConfirmed=cumsum(Confirmed),
TotalDeceased=cumsum(Deceased),
ConfirmedMA=(TotalConfirmed-lag(TotalConfirmed, 7))/7,
DeceasedMA=(TotalDeceased-lag(TotalDeceased, 7))/ 7
) %>%
ungroup() %>%
filter(!is.na(ConfirmedMA)) %>%
select(State, Date, Deceased=DeceasedMA, Confirmed=ConfirmedMA) ->
cov2

cov2 %>%
select(DeathDate=Date, State, Deceased) %>%
inner_join(
cov2 %>%
select(ConfirmDate=Date, State, Confirmed) %>%
crossing(Delay=1:100) %>%
mutate(DeathDate=ConfirmDate+Delay), 
by = c("DeathDate", "State")
) %>%
filter(DeathDate > ConfirmDate) %>%
arrange(State, desc(DeathDate), desc(ConfirmDate)) %>%
mutate(
Lambda=1/avgDays,
Adjusted=Confirmed * Lambda * (1-Lambda)^(Delay-1)
) %>%
filter(Deceased > 0) %>%
group_by(State, DeathDate, Deceased) %>%
summarise(Adjusted=sum(Adjusted)) %>%
ungroup() %>%
lm(Deceased~Adjusted-1, data=.) %>%
summary() %>%
broom::tidy() %>%
select(estimate) %>%
first() %>%
return()
}

More on covid testing

There has been a massive jump in the number of covid-19 positive cases in Karnataka over the last couple of days. Today, there were 44 new cases discovered, and yesterday there were 36. This is a big jump from the average of about 15 cases per day in the preceding 4-5 days.

The good news is that not all of this is new infection. A lot of cases that have come out today are clusters of people who have collectively tested positive. However, there is one bit from yesterday’s cases (again a bunch of clusters) that stands out.

Source: covid19india.org

I guess by now everyone knows what “travelled from Delhi” is a euphemism for. The reason they are interesting to me is that they are based on a “repeat test”. In other words, all these people had tested negative the first time they were tested, and then they were tested again yesterday and found positive.

Why did they need a repeat test? That’s because the sensitivity of the Covid-19 test is rather low. Out of every 100 infected people who take the test, only about 70 are found positive (on average) by the test. That also depends upon when the sample is taken.  From the abstract of this paper:

Over the four days of infection prior to the typical time of symptom onset (day 5) the probability of a false negative test in an infected individual falls from 100% on day one (95% CI 69-100%) to 61% on day four (95% CI 18-98%), though there is considerable uncertainty in these numbers. On the day of symptom onset, the median false negative rate was 39% (95% CI 16-77%). This decreased to 26% (95% CI 18-34%) on day 8 (3 days after symptom onset), then began to rise again, from 27% (95% CI 20-34%) on day 9 to 61% (95% CI 54-67%) on day 21.

About one in three (depending upon when you draw the sample) infected people who have the disease are found by the test to be uninfected. Maybe I should state it again. If you test a covid-19 positive person for covid-19, there is almost a one-third chance that she will be found negative.

The good news (at the face of it) is that the test has “high specificity” of about 97-98% (this is from conversations I’ve had with people in the know. I’m unable to find links to corroborate this), or a false positive rate of 2-3%. That seems rather accurate, except that when the “prior probability” of having the disease is low, even this specificity is not good enough.

Let’s assume that a million Indians are covid-19 positive (the official numbers as of today are a little more than one-hundredth of that number). With one and a third billion people, that represents 0.075% of the population.

Let’s say we were to start “random testing” (as a number of commentators are advocating), and were to pull a random person off the street to test for Covid-19. The “prior” (before testing) likelihood she has Covid-19 is 0.075% (assume we don’t know anything more about her to change this assumption).

If we were to take 20000 such people, 15 of them will have the disease. The other 19985 don’t. Let’s test all 20000 of them.

Of the 15 who have the disease, the test returns “positive” for 10.5 (70% accuracy, round up to 11). Of the 19985 who don’t have the disease, the test returns “positive” for 400 of them (let’s assume a specificity of 98% (or a false positive rate of 2%), placing more faith in the test)! In other words, if there were a million Covid-19 positive people in India, and a random Indian were to take the test and test positive, the likelihood she actually has the disease is 11/411 = 2.6%.

If there were 10 million covid-19 positive people in India (no harm in supposing), then the “base rate” would be .75%. So out of our sample of 20000, 150 would have the disease. Again testing all 20000, 105 of the 150 who have the disease would test positive. 397 of the 19850 who don’t have the disease will test positive. In other words, if there were ten million Covid-19 positive people in India, and a random Indian were to take the test and test positive, the likelihood she actually has the disease is 105/(397+105) = 21%.

If there were ten million Covid-19 positive people in India, only one-fifth of the people who tested positive in a random test would actually have the disease.

Take a sip of water (ok I’m reading The Ken’s Beyond The First Order too much nowadays, it seems).

This is all standard maths stuff, and any self-respecting book or course on probability and Bayes’s Theorem will have at least a reference to AIDS or cancer testing. The story goes that this was a big deal in the 1990s when some people suggested that the AIDS test be used widely. Then, once this problem of false positives and posterior probabilities was pointed out, the strategy of only testing “high risk cases” got accepted.

And with a “low incidence” disease like covid-19, effective testing means you test people with a high prior probability. In India, that has meant testing people who travelled abroad, people who have come in contact with other known infected, healthcare workers, people who attended the Tablighi Jamaat conference in Delhi, and so on.

The advantage with testing people who already have a reasonable chance of having the disease is that once the test returns positive, you can be pretty sure they actually have the disease. It is more effective and efficient. Testing people with a “high prior probability of disease” is not discriminatory, or a “sampling bias” as some commentators alleged. It is prudent statistical practice.

Again, as I found to my own detriment with my tweetstorm on this topic the other day, people are bound to see politics and ascribe political motives to everything nowadays. In that sense, a lot of the commentary is not surprising. It’s also not surprising that when “one wing” heavily retweeted my article, “the other wing” made efforts to find holes in my argument (which, again, is textbook math).

One possibly apolitical criticism of my tweetstorm was that “the purpose of random testing is not to find out who is positive. It is to find out what proportion of the population has the disease”. The cost of this (apart from the monetary cost of actually testing) are threefold. Firstly, a large number of uninfected people will get hospitalised in covid-specific hospitals, clogging hospital capacity and increasing the chances that they get infected while in hospital.

Secondly, getting a truly random sample in this case is tricky, and possibly unethical. When you have limited testing capacity, you would be inclined (possibly morally, even) to use it on people who already have a high prior probability.

Finally, when the incidence is small, we need a really large sample to find out the true range.

Let’s say 1 in 1000 Indians have the disease (or about 1.35 million people). Using the Chi Square test of proportions, our estimate of the incidence of the disease varies significantly on how many people are tested.

If we test a 1000 people and find 1 positive, the true incidence of the disease (95% confidence interval) could be anywhere from 0.01% to 0.65%.

If we test 10000 people and find 10 positive, the true incidence of the disease could be anywhere between 0.05% and 0.2%.

Only if we test 100000 people (a truly massive random sample) and find 100 positive, then the true incidence lies between 0.08% and 0.12%, an acceptable range.

I admit that we may not be testing enough. A simple rule of thumb is that anyone with more than a 5% prior probability of having the disease needs to be tested. How we determine this prior probability is again dependent on some rules of thumb.

I’ll close by saying that we should NOT be doing random testing. That would be unethical on multiple counts.

Surveying Income

For a long time now, I’ve been sceptical of the practice of finding out the average income in a country or state or city or locality by doing a random survey. The argument I’ve made is “whether you keep Mukesh Ambani in the sample or not makes a huge difference in your estimate”. So far, though, I hadn’t been able to make a proper mathematical argument.

In the course of writing a piece for Bloomberg Quint (my first for that publication), I figured out a precise mathematical argument. Basically, incomes are distributed according to a power law distribution, and the exponent of the power law means that variance is not defined. And hence the Central Limit Theorem isn’t applicable.

OK let me explain that in English. The reason sample surveys work is due to a result known as the Central Limit Theorem. This states that for a distribution with finite mean and variance, the average of a random sample of data points is not very far from the average of the population, and the difference follows a normal distribution with zero mean and variance that is inversely proportional to the number of points surveyed.

So if you want to find out the average height of the population of adults in an area, you can simply take a random sample, find out their heights and you can estimate the distribution of the average height of people in that area. It is similar with voting intention – as long as the sample of people you survey is random (and without bias), the average of their voting intention can tell you with high confidence the voting intention of the population.

This, however, doesn’t work for income. Based on data from the Indian Income Tax department, I could confirm (what theory states) that income in India follows a power law distribution. As I wrote in my piece:

The basic feature of a power law distribution is that it is self-similar – where a part of the distribution looks like the entire distribution.

Based on the income tax returns data, the number of taxpayers earning more than Rs 50 lakh is 40 times the number of taxpayers earning over Rs 5 crore.
The ratio of the number of people earning more than Rs 1 crore to the number of people earning over Rs 10 crore is 38.
About 36 times as many people earn more than Rs 5 crore as do people earning more than Rs 50 crore.

In other words, if you increase the income limit by a factor of 10, the number of people who earn over that limit falls by a factor between 35 and 40. This translates to a power law exponent between 1.55 and 1.6 (log 35 to base 10 and log 40 to base 10 respectively).

Now power laws have a quirk – their mean and variance are not always defined. If the exponent of the power law is less than 1, the mean is not defined. If the exponent is less than 2, then the distribution doesn’t have a defined variance. So in this case, with an exponent around 1.6, the distribution of income in India has a well-defined mean but no well-defined variance.

To recall, the central limit theorem states that the population mean follows a normal distribution with the mean centred at the sample mean, and a variance of \frac{\sigma^2}{n} where \sigma is the standard deviation of the underlying distribution. And when the underlying distribution itself is a power law distribution with an exponent less than 2 (as the case is in India), \sigma itself is not defined.

Which means the distribution of population mean around sample mean has infinite variance. Which means the sample mean tells you absolutely nothing!

And hence, surveying is not a good way to find the average income of a population.

Bayesian Reasoning and Indian Philosophy

I’m currently reading a book called How the World Thinks: A global history of philosophy by Julian Baggini. I must admit I bought this by mistake – I was at a bookshop where I saw this book and went to the Amazon website to check reviews. And by mistake I ended up hitting buy. And before I got around to returning it, I started reading and liking it, so I decided to keep it.

In any case, this book is a nice comparative history of world philosophies, with considerable focus on Indian, Chinese, Japanese and Islamic philosophies. The author himself is trained in European/Western philosophy, but he keeps an open mind and so far it’s been an engaging read.

Rather than approaching the topic in chronological order, like some historians might have been tempted to do, this book approaches it by concept, comparing how different philosophies treat the same concept. And the description of Indian philosophy in the “Logic” chapter caught my eye, in the sense that it reminded me of Bayesian logic, and a piece I’d written a few years back.

Talking about Hindu philosophy and logic, Baggins writes:

For instance, the Veda affirms that when the appropriate sacrifice for the sake of a son is performed, a son will be produced. But it is often observed that a son is not produced, even though the sacrifice has been performed. This would seem to be pretty conclusive proof that the sacrifices don’t work and so the Veda is flawed. Not, however, if you start from the assumption that the Veda cannot be flawed.

In other words, Hindu Philosophy starts with the Bayesian prior that the Veda cannot be flawed. Consequently, irrespective of how strong the empirical evidence that the Vedas are flawed, the belief in the Vedas can never change! On the other hand, if the prior probability that the Vedas were flawed were positive but even infinitesimal, then the amount of evidences such as the above (where sacrifices that are supposed to have produced sons but fail to do so) would over time result in the probability of the Vedas being flawed increasing, and soon tending to 1.

In 2015, I had written in Mint about how Bayesian logic can be used to explain online flame wars. There again, I had written about how when people start with extreme opinions (probabilities equal to 0 or 1), even the strongest contrary evidence is futile to get them to change their opinions. And hence in online flame wars you have people simply talking past each other because neither is willing to update their opinions in the face of evidence.

Coming back to Hindu philosophy, this prior belief that the Vedas cannot be flawed reminds me of the numerous futile arguments with some of my relatives who are of a rather religious persuasion. In each case I presented to them what seemed like strong proof that some of their assumptions of religion are flawed. In each case, irrespective of the strength of my evidence, they refused to heed my argument. Now, looking at the prior of a religious Hindu – that the likelihood of the Vedas being flawed is 0 (not infinitesimal, but 0), it is clear why my arguments fell on deaf ears.

In any case, Baggini goes on to say:

By this logic, if ‘a son is sure to be produced as a result of performing the sacrifice’ but a son is not produced, it can only follow that the sacrifice was not performed correctly, however much it seems that it was performed properly. By such argument, the Ny?ya S?tra can safely conclude, ‘Therefore there is no untruth in the Veda.’

Hypothesis Testing in Monte Carlo

I find it incredible, and not in a good way, that I took fourteen years to make the connection between two concepts I learnt barely a year apart.

In August-September 2003, I was auditing an advanced (graduate) course on Advanced Algorithms, where we learnt about randomised algorithms (I soon stopped auditing since the maths got heavy). And one important class of randomised algorithms is what is known as “Monte Carlo Algorithms”. Not to be confused with Monte Carlo Simulations, these are randomised algorithms that give a one way result. So, using the most prominent example of such an algorithm, you can ask “is this number prime?” and the answer to that can be either “maybe” or “no”.

The randomised algorithm can never conclusively answer “yes” to the primality question. If the algorithm can find a prime factor of the number, it answers “no” (this is conclusive). Otherwise it returns “maybe”. So the way you “conclude” that a number is prime is by running the test a large number of times. Each run reduces the probability that it is a “no” (since they’re all independent evaluations of “maybe”), and when the probability of “no” is low enough, you “think” it’s a “yes”. You might like this old post of mine regarding Monte Carlo algorithms in the context of romantic relationships.

Less than a year later, in July 2004, as part of a basic course in statistics, I learnt about hypothesis testing. Now (I’m kicking myself for failing to see the similarity then), the main principle of hypothesis testing is that you can never “accept a hypothesis”. You either reject a hypothesis or “fail to reject” it.  And if you fail to reject a hypothesis with a certain high probability (basically with more data, which implies more independent evaluations that don’t say “reject”), you will start thinking about “accept”.

Basically hypothesis testing is a one-sided  test, where you are trying to reject a hypothesis. And not being able to reject a hypothesis doesn’t mean we necessarily accept it – there is still the chance of going wrong if we were to accept it (this is where we get into messy territory such as p-values). And this is exactly like Monte Carlo algorithms – one-sided algorithms where we can only conclusively take a decision one way.

So I was thinking of these concepts when I came across this headline in ESPNCricinfo yesterday that said “Rahul Johri not found guilty” (not linking since Cricinfo has since changed the headline). The choice, or rather ordering, of words was interesting. “Not found guilty”, it said, rather than the usual “found not guilty”.

This is again a concept of one-sided testing. An investigation can either find someone guilty or it fails to do so, and the heading in this case suggested that the latter had happened. And as a deliberate choice, it became apparent why the headline was constructed this way – later it emerged that the decision to clear Rahul Johri of sexual harassment charges was a contentious one.

In most cases, when someone is “found not guilty” following an investigation, it usually suggests that the evidence on hand was enough to say that the chance of the person being guilty was rather low. The phrase “not found guilty”, on the other hand, says that one test failed to reject the hypothesis, but it didn’t have sufficient confidence to clear the person of guilt.

So due credit to the Cricinfo copywriters, and due debit to the product managers for later changing the headline rather than putting a fresh follow-up piece.

PS: The discussion following my tweet on the topic threw up one very interesting insight – such as Scotland having had a “not proven” verdict in the past for such cases (you can trust DD for coming up with such gems).

Dimensional analysis in stochastic finance

Yesterday I was reading through Ole Peters’s lecture notes on ergodicity, a topic that I got interested in thanks to my extensive use of Utility Theory in my work nowadays. And I had a revelation – that in standard stochastic finance, mean returns and standard deviation of returns don’t have the same dimensions. Instead, it’s mean returns and the variance of returns that have the same dimensions.

While this might sound counterintuitive, it is not hard to see if you think about it analytically. We will start with what is possibly the most basic equation in stochastic finance, which is the lognormal random walk model of stock prices.

dS = \mu S dt + \sigma S dW

This can be rewritten as

\frac{dS}{S} = \mu dt + \sigma dW

Now, let us look at dimensions. The LHS divides change in stock price by stock price, and is hence dimensionless. So the RHS needs to be dimensionless as well if the equation is to make sense.

It is easy to see that the first term on the RHS is dimensionless – \mu, the average returns or the drift, is defined as “returns per unit time”. So a stock that returns, on average, 10% in a year returns 20% in two years. So returns has dimensions t^{-1}, and multiplying it with dt which has the unit of time renders it dimensionless.

That leaves us with the last term. dW is the Wiener Process, and is defined such that dW^2 = dt. This implies that dW has the dimensions \sqrt{t}. This means that the equation is meaningful if and only if \sigma has dimensions t^{-\frac{1}{2}}, which is the same as saying that \sigma^2 has dimensions \frac{1}{t}, which is the same as the dimensions of the mean returns.

It is not hard to convince yourself that it makes intuitive sense as well. The basic assumption of a random walk is that the variance grows linearly with time (another way of seeing this is that when you add two uncorrelated random variables, their variances add up to give the variance of the sum). From this again, variance has the units of inverse time – the same as the mean.

Finally, speaking of dimensional analysis and Ole Peters, check out his proof of the Pythagoras Theorem using dimensional analysis.

Isn’t it beautiful?

PS: Speaking of dimensional analysis, check out my recent post on stocks and flows and financial ratios.

 

Machine learning and degrees of freedom

For starters, machine learning is not magic. It might appear like magic when you see Google Photos automatically tagging all your family members correctly, down to the day of their birth. It might appear so when Siri or Alexa give a perfect response to your request. And the way AlphaZero plays chess is almost human!

But no, machine learning is not magic. I’d made a detailed argument about that in the second edition of my newsletter (subscribe if you haven’t already!).

One way to think of it is that the output of a machine learning model (which could be anything from “does this picture contain a cat?” to “is the speaker speaking in English?”) is the result of a mathematical formula, whose parameters are unknown at the beginning of the exercise.

As the system gets “trained” (of late I’ve avoided using the word “training” in the context of machine learning, preferring to use “calibration” instead. But anyway…), the hitherto unknown parameters of the formula get adjusted in a manner that the formula output matches the given data. Once the system has “seen” enough data, we have a model, which can then be applied on unknown data (I’m completely simplifying it here).

The genius in machine learning comes in setting up mathematical formulae in a way that given input-output pairs of data can be used to adjust the parameters of the formulae. The genius in deep learning, which has been the rage this decade, for example, comes from a 30-year old mathematical breakthrough called “back propagation”. The reason it took until a few years back for it to become a “thing” has to do with data availability, and compute power (check this terrific piece in the MIT Tech Review about deep learning).

Within machine learning, the degree of complexity of a model can vary significantly. In an ordinary univariate least squares regression, for example, there are only two parameters the system can play with (slope and intercept of the regression line). Even a simple “shallow” neural network, on the other hand, has thousands of parameters.

Because a regression has so few parameters, the kind of patterns that the system can detect is rather limited (whatever you do, the system can only draw a line. Nothing more!). Thus, regression is applied only when you know that the relationship that exists is simple (and linear), or when you are trying to force-fit a linear model.

The upside of simple models such as regression is that because there are so few parameters to be adjusted, you need relatively few data points in order to adjust them to the required degree of accuracy.

As models get more and more complicated, the number of parameters increases, thus increasing the complexity of patterns that can be detected by the system. Close to one extreme, you have systems that see lots of current pictures of you and then identify you in your baby pictures.

Such complicated patterns can be identified because the system parameters have lots of degrees of freedom. The downside, of course, is that because the parameters start off having so much freedom, it takes that much more data to “tie them down”. The reason Google Photos can tag you in your baby pictures is partly down to the quantum of image data that Google has, which does an effective job of tying down the parameters. Google Translate similarly uses large repositories of multi-lingual text in order to “learn languages”.

Like most other things in life, machine learning also involves a tradeoff. It is possible for systems to identify complex patterns, but for that you need to start off with lots of “degrees of freedom”, and then use lots of data to tie down the variables. If your data is small, then you can only afford a small number of parameters, and that limits the complexity of patterns that can be detected.

One way around this, of course, is to use your own human intelligence as a pre-processing step in order to set up parameters in a way that they can be effectively tuned by data. Gopi had a nice post recently on “neat learning versus deep learning“, which is relevant in this context.

Finally, there is the issue of spurious correlations. Because machine learning systems are basically mathematical formulae designed to learn patterns from data, spurious correlations in the input dataset can lead to the system learning random things, which can hamper its predictive power.

Data sets, especially ones that have lots of dimensions, can display correlations that appear at random, but if the input dataset shows enough of these correlations, the system will “learn” them as a pattern, and try to use them in predictions. And the more complicated your model gets, the harder it is to know what it is doing, and thus the harder it is to identify these spurious correlations!

And the thing with having too many “free parameters” (lots of degrees of freedom but without enough data to tie down the parameters) is that these free parameters are especially susceptible to learning the spurious correlations – for they have no other job.

Thinking about it, after all, machine learning systems are not human!

JEE coaching and high school learning

One reason I’m not as good at machine learning as I can possibly be is because I suck at linear algebra. I totally completely suck at it. Seven years of usage of R has meant that at least I no longer get spooked out by the very sight of vectors or matrices, and I understand the concept of matrix multiplication (an operator rotating a vector), but I just don’t get linear algebra.

For example, when I see terms such as “singular value decomposition” I almost faint. Multiple repeated attempts at learning the concept have utterly failed. Don’t even get me started on the more complicated stuff – and machine learning is full of them.

My inability to understand linear algebra runs deep, and it’s mainly due to a complete inability to imagine vectors and matrices and matrix operations. As far back as I remember, I have hated matrices and have tried to run away from it.

For a long time, I had placed the blame for this on IIT Madras, whose mathematics department in its infinite wisdom had decided to get its brilliant Graph Theory expert to teach us matrices. Thinking back, though, I remember going in to MA102 (Vectors, Matrices and Differential Equations) already spooked. The rot had set in even earlier – in school.

The problem with class 11 in my school (a fairly high-profile school which was full of studmax characters) was that most people harboured ambitions of going to IIT, and had consequently enrolled themselves in formal coaching “factories”. As a result, these worthies always came to maths, physics and chemistry classes “ahead” of people like me who didn’t go for such classes (I’d decided to chill for a year after a rather hectic class 10 when I’d been under immense pressure to get my school a “centum”).

Because a large majority of the class already knew what was to be taught, teachers had an incentive to slack. Also the fact that most students were studmax had meant that people preferred to mug on their own rather than display their ignorance in class. And so jai happened.

I remember the class when vectors and matrices were introduced (it was in class 11). While I don’t remember too many details, I do remember that a vocal majority already knew about “dot product” and “cross product”. It was similar a few days later when the vocal majority knew matrix multiplication.

And so these concepts were glossed over, and lacking a grounding in fundamentals, I somehow never “got” the concept.

In my year (2000), CBSE decided to change format for its maths examination – everyone had to attempt “Part A” (worth 70 marks) and then had a choice between “Part B” (vectors, matrices, etc.) and “Part C” (introductory statistics). Most science students were expected to opt for Part B (Part C had been introduced for the benefit of commerce students studying maths since they had little to gain from reading about vectors). For me and one other guy from my class, though, it was a rather obvious choice to do Part C.

I remember the invigilator (who was from another school) being positively surprised during my board exam when I mentioned that I was going to attempt Part C instead of Part B. He muttered something to the extent of “isn’t that for commerce students?” but to his credit permitted us to do the paper in whatever way we wanted (I fail to remember why I had to mention to him I was doing Part C – maybe I needed log tables to do that).

Seventeen odd years down the line, I continue to suck at linear algebra and be stud at statistics. And it is all down to the way the two subjects were introduced to me in school (JEE statistics wasn’t up to the same standard as Part C so the school teachers did a great job of teaching that).

Maths, machine learning, brute force and elegance

Back when I was at the International Maths Olympiad Training Camp in Mumbai in 1999, the biggest insult one could hurl at a peer was to describe the latter’s solution to a problem as being a “brute force solution”. Brute force solutions, which were often ungainly, laboured and unintuitive were supposed to be the last resort, to be used only if one were thoroughly unable to implement an “elegant solution” to the problem.

Mathematicians love and value elegance. While they might be comfortable with esoteric formulae and the Greek alphabet, they are always on the lookout for solutions that are, at least to the trained eye, intuitive to perceive and understand. Among other things, it is the belief that it is much easier to get an intuitive understanding for an elegant solution.

When all the parts of the solution seem to fit so well into each other, with no loose ends, it is far easier to accept the solution as being correct (even if you don’t understand it fully). Brute force solutions, on the other hand, inevitably leave loose ends and appreciating them can be a fairly massive task, even to trained mathematicians.

In the conventional view, though, non-mathematicians don’t have much fondness for elegance. A solution is a solution, and a problem solved is a problem solved.

With the coming of big data and increased computational power, however, the tables are getting turned. In this case, the more mathematical people, who are more likely to appreciate “machine learning” algorithms recommend “leaving it to the system” – to unleash the brute force of computational power at the problem so that the “best model” can be found, and later implemented.

And in this case, it is the “half-blood mathematicians” like me, who are aware of complex algorithms but are unsure of letting the system take over stuff end-to-end, who bat for elegance – to look at data, massage it, analyse it and then find that one simple method or transformation that can throw immense light on the problem, effectively solving it!

The world moves in strange ways.