Machine learning and degrees of freedom

For starters, machine learning is not magic. It might appear like magic when you see Google Photos automatically tagging all your family members correctly, down to the day of their birth. It might appear so when Siri or Alexa give a perfect response to your request. And the way AlphaZero plays chess is almost human!

But no, machine learning is not magic. I’d made a detailed argument about that in the second edition of my newsletter (subscribe if you haven’t already!).

One way to think of it is that the output of a machine learning model (which could be anything from “does this picture contain a cat?” to “is the speaker speaking in English?”) is the result of a mathematical formula, whose parameters are unknown at the beginning of the exercise.

As the system gets “trained” (of late I’ve avoided using the word “training” in the context of machine learning, preferring to use “calibration” instead. But anyway…), the hitherto unknown parameters of the formula get adjusted in a manner that the formula output matches the given data. Once the system has “seen” enough data, we have a model, which can then be applied on unknown data (I’m completely simplifying it here).

The genius in machine learning comes in setting up mathematical formulae in a way that given input-output pairs of data can be used to adjust the parameters of the formulae. The genius in deep learning, which has been the rage this decade, for example, comes from a 30-year old mathematical breakthrough called “back propagation”. The reason it took until a few years back for it to become a “thing” has to do with data availability, and compute power (check this terrific piece in the MIT Tech Review about deep learning).

Within machine learning, the degree of complexity of a model can vary significantly. In an ordinary univariate least squares regression, for example, there are only two parameters the system can play with (slope and intercept of the regression line). Even a simple “shallow” neural network, on the other hand, has thousands of parameters.

Because a regression has so few parameters, the kind of patterns that the system can detect is rather limited (whatever you do, the system can only draw a line. Nothing more!). Thus, regression is applied only when you know that the relationship that exists is simple (and linear), or when you are trying to force-fit a linear model.

The upside of simple models such as regression is that because there are so few parameters to be adjusted, you need relatively few data points in order to adjust them to the required degree of accuracy.

As models get more and more complicated, the number of parameters increases, thus increasing the complexity of patterns that can be detected by the system. Close to one extreme, you have systems that see lots of current pictures of you and then identify you in your baby pictures.

Such complicated patterns can be identified because the system parameters have lots of degrees of freedom. The downside, of course, is that because the parameters start off having so much freedom, it takes that much more data to “tie them down”. The reason Google Photos can tag you in your baby pictures is partly down to the quantum of image data that Google has, which does an effective job of tying down the parameters. Google Translate similarly uses large repositories of multi-lingual text in order to “learn languages”.

Like most other things in life, machine learning also involves a tradeoff. It is possible for systems to identify complex patterns, but for that you need to start off with lots of “degrees of freedom”, and then use lots of data to tie down the variables. If your data is small, then you can only afford a small number of parameters, and that limits the complexity of patterns that can be detected.

One way around this, of course, is to use your own human intelligence as a pre-processing step in order to set up parameters in a way that they can be effectively tuned by data. Gopi had a nice post recently on “neat learning versus deep learning“, which is relevant in this context.

Finally, there is the issue of spurious correlations. Because machine learning systems are basically mathematical formulae designed to learn patterns from data, spurious correlations in the input dataset can lead to the system learning random things, which can hamper its predictive power.

Data sets, especially ones that have lots of dimensions, can display correlations that appear at random, but if the input dataset shows enough of these correlations, the system will “learn” them as a pattern, and try to use them in predictions. And the more complicated your model gets, the harder it is to know what it is doing, and thus the harder it is to identify these spurious correlations!

And the thing with having too many “free parameters” (lots of degrees of freedom but without enough data to tie down the parameters) is that these free parameters are especially susceptible to learning the spurious correlations – for they have no other job.

Thinking about it, after all, machine learning systems are not human!

JEE coaching and high school learning

One reason I’m not as good at machine learning as I can possibly be is because I suck at linear algebra. I totally completely suck at it. Seven years of usage of R has meant that at least I no longer get spooked out by the very sight of vectors or matrices, and I understand the concept of matrix multiplication (an operator rotating a vector), but I just don’t get linear algebra.

For example, when I see terms such as “singular value decomposition” I almost faint. Multiple repeated attempts at learning the concept have utterly failed. Don’t even get me started on the more complicated stuff – and machine learning is full of them.

My inability to understand linear algebra runs deep, and it’s mainly due to a complete inability to imagine vectors and matrices and matrix operations. As far back as I remember, I have hated matrices and have tried to run away from it.

For a long time, I had placed the blame for this on IIT Madras, whose mathematics department in its infinite wisdom had decided to get its brilliant Graph Theory expert to teach us matrices. Thinking back, though, I remember going in to MA102 (Vectors, Matrices and Differential Equations) already spooked. The rot had set in even earlier – in school.

The problem with class 11 in my school (a fairly high-profile school which was full of studmax characters) was that most people harboured ambitions of going to IIT, and had consequently enrolled themselves in formal coaching “factories”. As a result, these worthies always came to maths, physics and chemistry classes “ahead” of people like me who didn’t go for such classes (I’d decided to chill for a year after a rather hectic class 10 when I’d been under immense pressure to get my school a “centum”).

Because a large majority of the class already knew what was to be taught, teachers had an incentive to slack. Also the fact that most students were studmax had meant that people preferred to mug on their own rather than display their ignorance in class. And so jai happened.

I remember the class when vectors and matrices were introduced (it was in class 11). While I don’t remember too many details, I do remember that a vocal majority already knew about “dot product” and “cross product”. It was similar a few days later when the vocal majority knew matrix multiplication.

And so these concepts were glossed over, and lacking a grounding in fundamentals, I somehow never “got” the concept.

In my year (2000), CBSE decided to change format for its maths examination – everyone had to attempt “Part A” (worth 70 marks) and then had a choice between “Part B” (vectors, matrices, etc.) and “Part C” (introductory statistics). Most science students were expected to opt for Part B (Part C had been introduced for the benefit of commerce students studying maths since they had little to gain from reading about vectors). For me and one other guy from my class, though, it was a rather obvious choice to do Part C.

I remember the invigilator (who was from another school) being positively surprised during my board exam when I mentioned that I was going to attempt Part C instead of Part B. He muttered something to the extent of “isn’t that for commerce students?” but to his credit permitted us to do the paper in whatever way we wanted (I fail to remember why I had to mention to him I was doing Part C – maybe I needed log tables to do that).

Seventeen odd years down the line, I continue to suck at linear algebra and be stud at statistics. And it is all down to the way the two subjects were introduced to me in school (JEE statistics wasn’t up to the same standard as Part C so the school teachers did a great job of teaching that).

Maths, machine learning, brute force and elegance

Back when I was at the International Maths Olympiad Training Camp in Mumbai in 1999, the biggest insult one could hurl at a peer was to describe the latter’s solution to a problem as being a “brute force solution”. Brute force solutions, which were often ungainly, laboured and unintuitive were supposed to be the last resort, to be used only if one were thoroughly unable to implement an “elegant solution” to the problem.

Mathematicians love and value elegance. While they might be comfortable with esoteric formulae and the Greek alphabet, they are always on the lookout for solutions that are, at least to the trained eye, intuitive to perceive and understand. Among other things, it is the belief that it is much easier to get an intuitive understanding for an elegant solution.

When all the parts of the solution seem to fit so well into each other, with no loose ends, it is far easier to accept the solution as being correct (even if you don’t understand it fully). Brute force solutions, on the other hand, inevitably leave loose ends and appreciating them can be a fairly massive task, even to trained mathematicians.

In the conventional view, though, non-mathematicians don’t have much fondness for elegance. A solution is a solution, and a problem solved is a problem solved.

With the coming of big data and increased computational power, however, the tables are getting turned. In this case, the more mathematical people, who are more likely to appreciate “machine learning” algorithms recommend “leaving it to the system” – to unleash the brute force of computational power at the problem so that the “best model” can be found, and later implemented.

And in this case, it is the “half-blood mathematicians” like me, who are aware of complex algorithms but are unsure of letting the system take over stuff end-to-end, who bat for elegance – to look at data, massage it, analyse it and then find that one simple method or transformation that can throw immense light on the problem, effectively solving it!

The world moves in strange ways.

Bayesian recognition in baby similarity

When people come to see small babies, it’s almost like they’re obliged to offer their opinions on who the child looks like. Most of the time it’s an immediate ancestor – either a parent or grandparent. Sometimes it could be a cousin or aunt or uncle as well. Thankfully it’s uncommon to compare babies’ looks to those who they don’t share genes with.

So as people have come up and offered their opinions on who our daughter looks like (I’m top seed, I must mention), I’ve been trying to analyse how they come up with their predictions. And as I observe the connections between people making the observations, and who they mention, I realise that this too follows some kind of Bayesian Recognition.

Basically different people who come to see the baby have different amounts of information on how each of the baby’s ancestors looked like. A recent friend of mine, for example, will only know how my wife and I look. An older friend might have some idea of how my parents looked. A relative might have a better judgment of how one of my parents looked than how I looked.

So based on their experiences in recognising different people in and around the baby’s immediate ancestry, they effectively start with a prior distribution of who the baby looks like. And then when they see the baby, they update their priors, and then mention the person with the highest posterior probability of matching the baby’s face and features.

Given that posterior probability is a function of prior probability, there is no surprise that different people will disagree on who the baby looks like. After all, each of their private knowledge of the baby’s ancestry’s idiosyncratic faces, and thus their priors, will be different!

Unrelated, but staying on Bayesian reasoning, I recently read this fairly stud piece in Aeon on why stereotyping is not necessarily a bad thing. The article argues that in the absence of further information, stereotypes help us form a good first prior, and that stereotypes only become a problem if we fail to update our priors with any additional information we get.

Half life of pain

Last evening, the obstetrician came over to check on the wife, following the afternoon’s Caesarean section operation. Upon being asked how she was, the wife replied that she’s feeling good, except that she was still in a lot of pain. “In how many days can I expect this pain to subside?”, she asked.

The doctor replied that it was a really hard question to answer, since there was no definite time frame. “All I can tell you is that the pain will go down gradually, so it’s hard to say whether it lasts 5 days or 10 days. Think of this – if you hurt your foot and there’s a blood clot, isn’t the recovery gradual? It’s the same in this case”.

While she was saying this, I was reminded of exponential decay, and started wondering whether post-operative pain (irrespective of the kind of surgery) follows exponential decay, decreasing by a certain percentage each day; and when someone says pain “disappears” after a certain number of days, it means that pain goes below a particular  threshold in that time period – and this particular threshold can vary from person to person.

So in that sense, rather than simply telling my wife that the pain will “decrease gradually”, the obstetrician could have been more helpful by saying “the pain will decrease gradually, and will reduce to half in about N days”, and then based on the value of N, my wife could determine, based on her threshold, when her pain would “go”.

Nevertheless, the doctor’s logic (that pain never “disappears discretely”) had me impressed, and I’ve mentioned before on this blog about how I get really impressed with doctors who are logically aware.

Oh, and I must mention that the same obstetrician who operated on my wife yesterday impressed me with her logical reasoning a week ago. My then unborn daughter wasn’t moving too well that day, because of which we were in hospital. My wife was given steroidal injections, and the baby started moving an hour later.

So when we mentioned to the obstetrician that “after you gave the steroids the baby started moving”, she curtly replied “the baby moving has nothing to do with the steroidal injections. The baby moves because the baby moves. It is just a coincidence that it happened after I gave the steroids”.

Bayes and serial correlation in disagreements

People who have been in a long-term relationship are likely to recognise that fights between a couple are not Markovian – in that the likelihood of fighting today is not independent of the likelihood of having fought yesterday.

In fact, if you had fought in a particular time period, it increases the likelihood that you’ll fight in the next time period. As a consequence, what you are likely to find is that there are times when you go days, or weeks, or even months, together in a perennial state of disagreement, while you’ll also have long periods of peace and bliss.

While this serial correlation can be disconcerting at times, and make you wonder whether you are in a relationship with the right person, it is not hard to understand why this happens. Once again, our old friend Reverend Thomas Bayes comes to the rescue here.

This is an extremely simplified model, but will serve the purpose of this post. Each half of a couple beliefs that the other (better?) half can exist in one of two states – “nice” and “jerk”. In fact, it’s unlikely anyone will completely exist in one of these states – they’re likely to exist in a superposition of these states.

So let’s say that the probability of your partner being a jerk is P(J), which makes the probability of him/her being “nice” at P(N) = 1- P(J). Now when he/she does or says something (let’s call this event E), you implicitly do a Bayesian updation of these probabilities.

For every word/action of your partner, you can estimate the probabilities in the two cases of your partner being jerk, and nice. After every action E by the partner, you update your priors about them with the new information.

So the new probability of him being a jerk (given event E) will be given by
P(J|E) = \frac{P(J).P(E|J)}{P(J).P(E|J) + P(N).P(E|N)} (the standard Bayesian  formula).

Now notice that the new probability of the partner being a jerk is dependent upon the prior probability. So when P(J) is already high, it is highly likely that whatever action the partner does will not move the needle significantly. And the longer P(J) stays high, the higher the probability that you’ll lapse into a fight again. Hence the prolonged periods of fighting, and serial correlation.

This equation also explains why attempts to resolve a fight quickly can backfire. When you are fighting, the normal reaction to resolve it is by committing actions that indicate that you are actually nice. The problem is that the equation above has both P(E|N) and P(E|J) in it.

So, in order to resolve a fight, you should not only commit actions that you would do when you are perceived nice, but also actions that you would NOT do if you are a jerk. In other words, the easiest way to pull P(J) down in the above equation is to commit E with high P(E|N) and low P(E|J).

What complicates things is that if you use one such weapon too many times, the partner will will begin to see through you, and up her P(E|J) for this event. So you need to keep coming up with new tricks to defuse fights.

In short, that serial correlation exists in relationship fights is a given, and there is little you can do to prevent it. So if you go through a long period of continuous disagreement with your partner, keep in mind that such things are par for the course, and don’t do something drastic like breaking up.

Horses, Zebras and Bayesian reasoning

David Henderson at Econlog quotes a doctor on a rather interesting and important point, regarding Bayesian priors. He writes:

 Later, when I went to see his partner, my regular doctor, to discuss something else, I mentioned that incident. He smiled and said that one of the most important lessons he learned from one of his teachers in medical school was:

When you hear hooves, think horses, not zebras.

This was after he had some symptoms that are correlated with heart attack and panicked and called his doctor, got treated for gas trouble and was absolutely fine after that.

Our problem is that when we have symptoms that are correlated with something bad, we immediately assume that it’s the bad thing that has happened, and panic. In that process we don’t consider alternate reasonings, and then do a Bayesian analysis.

Let me illustrate with a personal example. Back when I was a schoolboy, and I wouldn’t return home from school at the right time, my mother would panic. This was the time before cellphones, remember, and she would just assume that “the worst” had happened and that I was in trouble somewhere. Calls would go to my father’s office, and he would ask her to wait, though to my credit I was never so late that they had to take any further action.

Now, coming home late from school can happen due to a variety of reasons. Let us eliminate reasons such as wanting to play basketball for a while before returning – since such activities were “usual” and been budgeted for. So let’s assume that there are two possible reasons I’m late – the first is that I had gotten into trouble – I had either been knocked down on my way home or gotten kidnapped. The second is that the BTS (Bangalore Transport Service, as it was then called) schedule had gone completely awry, thanks to which I had missed my usual set of buses, and was thus delayed. Note that me not turning up at home until a certain point of time was a symptom of both of these.

Having noticed such a symptom, my mother would automatically come to the “worst case” conclusion (that I had been knocked down or kidnapped), and panic.   But then I’m not sure that was the more rational reaction. What she should have done was to do a Bayesian analysis and use that to guide her panic.

Let A be the event that I’d been knocked over or kidnapped, and B be the event that the bus schedule had gone awry. Let L(t) be the event that I haven’t gotten home till time t, and that such an event has been “observed”. The question is that, with L(t) having been observed, what are the odds of A and B having happened? Bayes Theorem gives us an answer. The equation is rather simple:

P(A | L(t) ) =  P(A).P(L(t)|A) / (P(A).P(L(t)|A) + P(B).P(L(t)|B) )

P(B|L(t)) is just one minus the above quantity (we assume that there is nothing else that can cause L(t)) .

So now let us give values. I’m too lazy to find the data now, but let’s say we find from the national crime data that the odds of a fifteen-year-old boy being in an accident or kidnapped on a given day is one in a million. And if that happens, then L(t) obviously gets observed. So we have

P(A) = \frac{1}{1000000}
P(L(t) | A) = 1

The BTS was notorious back in the day for its delayed and messed up schedules. So let us assume that P(B) is \frac{1}{100}. Now, P(L(t)|B) is tricky, and the reason the (t) qualifier has been added to L. The larger t is, the smaller the value of L(t)|B. If there is a bus schedule breakdown, there is probably a 50% probability that I’m not home an hour after “usual”. But there is only a 10% probability that I’m not home two hours after “usual” because a bus breakdown happened. So

P(L(1)|B) = 0.5
P(L(2)|B) = 0.1

Now let’s plug in and based on how delayed I was, find the odds that I was knocked down/kidnapped. If I were late by an hour,
P(A|L(1)) = \frac{ \frac{1}{1000000} \ 1 }{ \frac{1}{1000000}  \ 1 + \frac{1}{100} \ 0.5}
or P(A|L(1)) = 0.00019996. In other words, if I didn’t get home an hour later than usual, the odds that I had been knocked down or kidnapped was just one in five thousand!

What if I didn’t come home two hours after my normal time? Again we can plug into the formula, and here we find that P(A|L(2)) = 0.000999 or one in a thousand! So notice that the later I am, the higher the odds that I’m in trouble. Yet, the numbers (admittedly based on the handwaving assumptions above) are small enough for us to not worry!

Bayesian reasoning has its implications elsewhere, too. There is the medical case, as Henderson’s blogpost illustrates. Then we can use this to determine whether a wrong act was due to stupidity or due to malice. And so forth.

But what Henderson’s doctor told him is truly an immortal line:

When you hear hooves, think horses, not zebras.