mathematics – Page 2 – Pertinent Observations

Bayesian Reasoning and Indian Philosophy

I’m currently reading a book called How the World Thinks: A global history of philosophy by Julian Baggini. I must admit I bought this by mistake – I was at a bookshop where I saw this book and went to the Amazon website to check reviews. And by mistake I ended up hitting buy. And before I got around to returning it, I started reading and liking it, so I decided to keep it.

In any case, this book is a nice comparative history of world philosophies, with considerable focus on Indian, Chinese, Japanese and Islamic philosophies. The author himself is trained in European/Western philosophy, but he keeps an open mind and so far it’s been an engaging read.

Rather than approaching the topic in chronological order, like some historians might have been tempted to do, this book approaches it by concept, comparing how different philosophies treat the same concept. And the description of Indian philosophy in the “Logic” chapter caught my eye, in the sense that it reminded me of Bayesian logic, and a piece I’d written a few years back.

Talking about Hindu philosophy and logic, Baggins writes:

For instance, the Veda affirms that when the appropriate sacrifice for the sake of a son is performed, a son will be produced. But it is often observed that a son is not produced, even though the sacrifice has been performed. This would seem to be pretty conclusive proof that the sacrifices don’t work and so the Veda is flawed. Not, however, if you start from the assumption that the Veda cannot be flawed.

In other words, Hindu Philosophy starts with the Bayesian prior that the Veda cannot be flawed. Consequently, irrespective of how strong the empirical evidence that the Vedas are flawed, the belief in the Vedas can never change! On the other hand, if the prior probability that the Vedas were flawed were positive but even infinitesimal, then the amount of evidences such as the above (where sacrifices that are supposed to have produced sons but fail to do so) would over time result in the probability of the Vedas being flawed increasing, and soon tending to 1.

In 2015, I had written in Mint about how Bayesian logic can be used to explain online flame wars. There again, I had written about how when people start with extreme opinions (probabilities equal to 0 or 1), even the strongest contrary evidence is futile to get them to change their opinions. And hence in online flame wars you have people simply talking past each other because neither is willing to update their opinions in the face of evidence.

Coming back to Hindu philosophy, this prior belief that the Vedas cannot be flawed reminds me of the numerous futile arguments with some of my relatives who are of a rather religious persuasion. In each case I presented to them what seemed like strong proof that some of their assumptions of religion are flawed. In each case, irrespective of the strength of my evidence, they refused to heed my argument. Now, looking at the prior of a religious Hindu – that the likelihood of the Vedas being flawed is 0 (not infinitesimal, but 0), it is clear why my arguments fell on deaf ears.

In any case, Baggini goes on to say:

By this logic, if ‘a son is sure to be produced as a result of performing the sacrifice’ but a son is not produced, it can only follow that the sacrifice was not performed correctly, however much it seems that it was performed properly. By such argument, the Ny?ya S?tra can safely conclude, ‘Therefore there is no untruth in the Veda.’

Hypothesis Testing in Monte Carlo

I find it incredible, and not in a good way, that I took fourteen years to make the connection between two concepts I learnt barely a year apart.

In August-September 2003, I was auditing an advanced (graduate) course on Advanced Algorithms, where we learnt about randomised algorithms (I soon stopped auditing since the maths got heavy). And one important class of randomised algorithms is what is known as “Monte Carlo Algorithms”. Not to be confused with Monte Carlo Simulations, these are randomised algorithms that give a one way result. So, using the most prominent example of such an algorithm, you can ask “is this number prime?” and the answer to that can be either “maybe” or “no”.

The randomised algorithm can never conclusively answer “yes” to the primality question. If the algorithm can find a prime factor of the number, it answers “no” (this is conclusive). Otherwise it returns “maybe”. So the way you “conclude” that a number is prime is by running the test a large number of times. Each run reduces the probability that it is a “no” (since they’re all independent evaluations of “maybe”), and when the probability of “no” is low enough, you “think” it’s a “yes”. You might like this old post of mine regarding Monte Carlo algorithms in the context of romantic relationships.

Less than a year later, in July 2004, as part of a basic course in statistics, I learnt about hypothesis testing. Now (I’m kicking myself for failing to see the similarity then), the main principle of hypothesis testing is that you can never “accept a hypothesis”. You either reject a hypothesis or “fail to reject” it. And if you fail to reject a hypothesis with a certain high probability (basically with more data, which implies more independent evaluations that don’t say “reject”), you will start thinking about “accept”.

Basically hypothesis testing is a one-sided test, where you are trying to reject a hypothesis. And not being able to reject a hypothesis doesn’t mean we necessarily accept it – there is still the chance of going wrong if we were to accept it (this is where we get into messy territory such as p-values). And this is exactly like Monte Carlo algorithms – one-sided algorithms where we can only conclusively take a decision one way.

So I was thinking of these concepts when I came across this headline in ESPNCricinfo yesterday that said “Rahul Johri not found guilty” (not linking since Cricinfo has since changed the headline). The choice, or rather ordering, of words was interesting. “Not found guilty”, it said, rather than the usual “found not guilty”.

This is again a concept of one-sided testing. An investigation can either find someone guilty or it fails to do so, and the heading in this case suggested that the latter had happened. And as a deliberate choice, it became apparent why the headline was constructed this way – later it emerged that the decision to clear Rahul Johri of sexual harassment charges was a contentious one.

In most cases, when someone is “found not guilty” following an investigation, it usually suggests that the evidence on hand was enough to say that the chance of the person being guilty was rather low. The phrase “not found guilty”, on the other hand, says that one test failed to reject the hypothesis, but it didn’t have sufficient confidence to clear the person of guilt.

So due credit to the Cricinfo copywriters, and due debit to the product managers for later changing the headline rather than putting a fresh follow-up piece.

PS: The discussion following my tweet on the topic threw up one very interesting insight – such as Scotland having had a “not proven” verdict in the past for such cases (you can trust DD for coming up with such gems).

Dimensional analysis in stochastic finance

Yesterday I was reading through Ole Peters’s lecture notes on ergodicity, a topic that I got interested in thanks to my extensive use of Utility Theory in my work nowadays. And I had a revelation – that in standard stochastic finance, mean returns and standard deviation of returns don’t have the same dimensions. Instead, it’s mean returns and the variance of returns that have the same dimensions.

While this might sound counterintuitive, it is not hard to see if you think about it analytically. We will start with what is possibly the most basic equation in stochastic finance, which is the lognormal random walk model of stock prices.

$dS = \mu S dt + \sigma S dW$

This can be rewritten as

$\frac{dS}{S} = \mu dt + \sigma dW$

Now, let us look at dimensions. The LHS divides change in stock price by stock price, and is hence dimensionless. So the RHS needs to be dimensionless as well if the equation is to make sense.

It is easy to see that the first term on the RHS is dimensionless – $\mu$ , the average returns or the drift, is defined as “returns per unit time”. So a stock that returns, on average, 10% in a year returns 20% in two years. So returns has dimensions $t^{-1}$ , and multiplying it with $dt$ which has the unit of time renders it dimensionless.

That leaves us with the last term. $dW$ is the Wiener Process, and is defined such that $dW^2 = dt$ . This implies that $dW$ has the dimensions $\sqrt{t}$ . This means that the equation is meaningful if and only if $\sigma$ has dimensions $t^{-\frac{1}{2}}$ , which is the same as saying that $\sigma^2$ has dimensions $\frac{1}{t}$ , which is the same as the dimensions of the mean returns.

It is not hard to convince yourself that it makes intuitive sense as well. The basic assumption of a random walk is that the variance grows linearly with time (another way of seeing this is that when you add two uncorrelated random variables, their variances add up to give the variance of the sum). From this again, variance has the units of inverse time – the same as the mean.

Finally, speaking of dimensional analysis and Ole Peters, check out his proof of the Pythagoras Theorem using dimensional analysis.

People were asking about dimensional analysis.
If a technique produces a three-line proof of Pythagoras, come on! you know you've got something good. Here's a sketch — full details and caveats in Barenblatt's "Scaling." pic.twitter.com/u9zOtLyGpC

— Ole Peters (@ole_b_peters) September 3, 2018

Isn’t it beautiful?

PS: Speaking of dimensional analysis, check out my recent post on stocks and flows and financial ratios.

Machine learning and degrees of freedom

For starters, machine learning is not magic. It might appear like magic when you see Google Photos automatically tagging all your family members correctly, down to the day of their birth. It might appear so when Siri or Alexa give a perfect response to your request. And the way AlphaZero plays chess is almost human!

But no, machine learning is not magic. I’d made a detailed argument about that in the second edition of my newsletter (subscribe if you haven’t already!).

One way to think of it is that the output of a machine learning model (which could be anything from “does this picture contain a cat?” to “is the speaker speaking in English?”) is the result of a mathematical formula, whose parameters are unknown at the beginning of the exercise.

As the system gets “trained” (of late I’ve avoided using the word “training” in the context of machine learning, preferring to use “calibration” instead. But anyway…), the hitherto unknown parameters of the formula get adjusted in a manner that the formula output matches the given data. Once the system has “seen” enough data, we have a model, which can then be applied on unknown data (I’m completely simplifying it here).

The genius in machine learning comes in setting up mathematical formulae in a way that given input-output pairs of data can be used to adjust the parameters of the formulae. The genius in deep learning, which has been the rage this decade, for example, comes from a 30-year old mathematical breakthrough called “back propagation”. The reason it took until a few years back for it to become a “thing” has to do with data availability, and compute power (check this terrific piece in the MIT Tech Review about deep learning).

Within machine learning, the degree of complexity of a model can vary significantly. In an ordinary univariate least squares regression, for example, there are only two parameters the system can play with (slope and intercept of the regression line). Even a simple “shallow” neural network, on the other hand, has thousands of parameters.

Because a regression has so few parameters, the kind of patterns that the system can detect is rather limited (whatever you do, the system can only draw a line. Nothing more!). Thus, regression is applied only when you know that the relationship that exists is simple (and linear), or when you are trying to force-fit a linear model.

The upside of simple models such as regression is that because there are so few parameters to be adjusted, you need relatively few data points in order to adjust them to the required degree of accuracy.

As models get more and more complicated, the number of parameters increases, thus increasing the complexity of patterns that can be detected by the system. Close to one extreme, you have systems that see lots of current pictures of you and then identify you in your baby pictures.

Such complicated patterns can be identified because the system parameters have lots of degrees of freedom. The downside, of course, is that because the parameters start off having so much freedom, it takes that much more data to “tie them down”. The reason Google Photos can tag you in your baby pictures is partly down to the quantum of image data that Google has, which does an effective job of tying down the parameters. Google Translate similarly uses large repositories of multi-lingual text in order to “learn languages”.

Like most other things in life, machine learning also involves a tradeoff. It is possible for systems to identify complex patterns, but for that you need to start off with lots of “degrees of freedom”, and then use lots of data to tie down the variables. If your data is small, then you can only afford a small number of parameters, and that limits the complexity of patterns that can be detected.

One way around this, of course, is to use your own human intelligence as a pre-processing step in order to set up parameters in a way that they can be effectively tuned by data. Gopi had a nice post recently on “neat learning versus deep learning“, which is relevant in this context.

Finally, there is the issue of spurious correlations. Because machine learning systems are basically mathematical formulae designed to learn patterns from data, spurious correlations in the input dataset can lead to the system learning random things, which can hamper its predictive power.

Data sets, especially ones that have lots of dimensions, can display correlations that appear at random, but if the input dataset shows enough of these correlations, the system will “learn” them as a pattern, and try to use them in predictions. And the more complicated your model gets, the harder it is to know what it is doing, and thus the harder it is to identify these spurious correlations!

And the thing with having too many “free parameters” (lots of degrees of freedom but without enough data to tie down the parameters) is that these free parameters are especially susceptible to learning the spurious correlations – for they have no other job.

Thinking about it, after all, machine learning systems are not human!

JEE coaching and high school learning

One reason I’m not as good at machine learning as I can possibly be is because I suck at linear algebra. I totally completely suck at it. Seven years of usage of R has meant that at least I no longer get spooked out by the very sight of vectors or matrices, and I understand the concept of matrix multiplication (an operator rotating a vector), but I just don’t get linear algebra.

For example, when I see terms such as “singular value decomposition” I almost faint. Multiple repeated attempts at learning the concept have utterly failed. Don’t even get me started on the more complicated stuff – and machine learning is full of them.

My inability to understand linear algebra runs deep, and it’s mainly due to a complete inability to imagine vectors and matrices and matrix operations. As far back as I remember, I have hated matrices and have tried to run away from it.

For a long time, I had placed the blame for this on IIT Madras, whose mathematics department in its infinite wisdom had decided to get its brilliant Graph Theory expert to teach us matrices. Thinking back, though, I remember going in to MA102 (Vectors, Matrices and Differential Equations) already spooked. The rot had set in even earlier – in school.

The problem with class 11 in my school (a fairly high-profile school which was full of studmax characters) was that most people harboured ambitions of going to IIT, and had consequently enrolled themselves in formal coaching “factories”. As a result, these worthies always came to maths, physics and chemistry classes “ahead” of people like me who didn’t go for such classes (I’d decided to chill for a year after a rather hectic class 10 when I’d been under immense pressure to get my school a “centum”).

Because a large majority of the class already knew what was to be taught, teachers had an incentive to slack. Also the fact that most students were studmax had meant that people preferred to mug on their own rather than display their ignorance in class. And so jai happened.

I remember the class when vectors and matrices were introduced (it was in class 11). While I don’t remember too many details, I do remember that a vocal majority already knew about “dot product” and “cross product”. It was similar a few days later when the vocal majority knew matrix multiplication.

And so these concepts were glossed over, and lacking a grounding in fundamentals, I somehow never “got” the concept.

In my year (2000), CBSE decided to change format for its maths examination – everyone had to attempt “Part A” (worth 70 marks) and then had a choice between “Part B” (vectors, matrices, etc.) and “Part C” (introductory statistics). Most science students were expected to opt for Part B (Part C had been introduced for the benefit of commerce students studying maths since they had little to gain from reading about vectors). For me and one other guy from my class, though, it was a rather obvious choice to do Part C.

I remember the invigilator (who was from another school) being positively surprised during my board exam when I mentioned that I was going to attempt Part C instead of Part B. He muttered something to the extent of “isn’t that for commerce students?” but to his credit permitted us to do the paper in whatever way we wanted (I fail to remember why I had to mention to him I was doing Part C – maybe I needed log tables to do that).

Seventeen odd years down the line, I continue to suck at linear algebra and be stud at statistics. And it is all down to the way the two subjects were introduced to me in school (JEE statistics wasn’t up to the same standard as Part C so the school teachers did a great job of teaching that).

Maths, machine learning, brute force and elegance

Back when I was at the International Maths Olympiad Training Camp in Mumbai in 1999, the biggest insult one could hurl at a peer was to describe the latter’s solution to a problem as being a “brute force solution”. Brute force solutions, which were often ungainly, laboured and unintuitive were supposed to be the last resort, to be used only if one were thoroughly unable to implement an “elegant solution” to the problem.

Mathematicians love and value elegance. While they might be comfortable with esoteric formulae and the Greek alphabet, they are always on the lookout for solutions that are, at least to the trained eye, intuitive to perceive and understand. Among other things, it is the belief that it is much easier to get an intuitive understanding for an elegant solution.

When all the parts of the solution seem to fit so well into each other, with no loose ends, it is far easier to accept the solution as being correct (even if you don’t understand it fully). Brute force solutions, on the other hand, inevitably leave loose ends and appreciating them can be a fairly massive task, even to trained mathematicians.

In the conventional view, though, non-mathematicians don’t have much fondness for elegance. A solution is a solution, and a problem solved is a problem solved.

With the coming of big data and increased computational power, however, the tables are getting turned. In this case, the more mathematical people, who are more likely to appreciate “machine learning” algorithms recommend “leaving it to the system” – to unleash the brute force of computational power at the problem so that the “best model” can be found, and later implemented.

And in this case, it is the “half-blood mathematicians” like me, who are aware of complex algorithms but are unsure of letting the system take over stuff end-to-end, who bat for elegance – to look at data, massage it, analyse it and then find that one simple method or transformation that can throw immense light on the problem, effectively solving it!

The world moves in strange ways.

Bayesian recognition in baby similarity

When people come to see small babies, it’s almost like they’re obliged to offer their opinions on who the child looks like. Most of the time it’s an immediate ancestor – either a parent or grandparent. Sometimes it could be a cousin or aunt or uncle as well. Thankfully it’s uncommon to compare babies’ looks to those who they don’t share genes with.

So as people have come up and offered their opinions on who our daughter looks like (I’m top seed, I must mention), I’ve been trying to analyse how they come up with their predictions. And as I observe the connections between people making the observations, and who they mention, I realise that this too follows some kind of Bayesian Recognition.

Basically different people who come to see the baby have different amounts of information on how each of the baby’s ancestors looked like. A recent friend of mine, for example, will only know how my wife and I look. An older friend might have some idea of how my parents looked. A relative might have a better judgment of how one of my parents looked than how I looked.

So based on their experiences in recognising different people in and around the baby’s immediate ancestry, they effectively start with a prior distribution of who the baby looks like. And then when they see the baby, they update their priors, and then mention the person with the highest posterior probability of matching the baby’s face and features.

Given that posterior probability is a function of prior probability, there is no surprise that different people will disagree on who the baby looks like. After all, each of their private knowledge of the baby’s ancestry’s idiosyncratic faces, and thus their priors, will be different!

Unrelated, but staying on Bayesian reasoning, I recently read this fairly stud piece in Aeon on why stereotyping is not necessarily a bad thing. The article argues that in the absence of further information, stereotypes help us form a good first prior, and that stereotypes only become a problem if we fail to update our priors with any additional information we get.

Half life of pain

Last evening, the obstetrician came over to check on the wife, following the afternoon’s Caesarean section operation. Upon being asked how she was, the wife replied that she’s feeling good, except that she was still in a lot of pain. “In how many days can I expect this pain to subside?”, she asked.

The doctor replied that it was a really hard question to answer, since there was no definite time frame. “All I can tell you is that the pain will go down gradually, so it’s hard to say whether it lasts 5 days or 10 days. Think of this – if you hurt your foot and there’s a blood clot, isn’t the recovery gradual? It’s the same in this case”.

While she was saying this, I was reminded of exponential decay, and started wondering whether post-operative pain (irrespective of the kind of surgery) follows exponential decay, decreasing by a certain percentage each day; and when someone says pain “disappears” after a certain number of days, it means that pain goes below a particular threshold in that time period – and this particular threshold can vary from person to person.

So in that sense, rather than simply telling my wife that the pain will “decrease gradually”, the obstetrician could have been more helpful by saying “the pain will decrease gradually, and will reduce to half in about N days”, and then based on the value of N, my wife could determine, based on her threshold, when her pain would “go”.

Nevertheless, the doctor’s logic (that pain never “disappears discretely”) had me impressed, and I’ve mentioned before on this blog about how I get really impressed with doctors who are logically aware.

Oh, and I must mention that the same obstetrician who operated on my wife yesterday impressed me with her logical reasoning a week ago. My then unborn daughter wasn’t moving too well that day, because of which we were in hospital. My wife was given steroidal injections, and the baby started moving an hour later.

So when we mentioned to the obstetrician that “after you gave the steroids the baby started moving”, she curtly replied “the baby moving has nothing to do with the steroidal injections. The baby moves because the baby moves. It is just a coincidence that it happened after I gave the steroids”.

Bayes and serial correlation in disagreements

People who have been in a long-term relationship are likely to recognise that fights between a couple are not Markovian – in that the likelihood of fighting today is not independent of the likelihood of having fought yesterday.

In fact, if you had fought in a particular time period, it increases the likelihood that you’ll fight in the next time period. As a consequence, what you are likely to find is that there are times when you go days, or weeks, or even months, together in a perennial state of disagreement, while you’ll also have long periods of peace and bliss.

While this serial correlation can be disconcerting at times, and make you wonder whether you are in a relationship with the right person, it is not hard to understand why this happens. Once again, our old friend Reverend Thomas Bayes comes to the rescue here.

This is an extremely simplified model, but will serve the purpose of this post. Each half of a couple beliefs that the other (better?) half can exist in one of two states – “nice” and “jerk”. In fact, it’s unlikely anyone will completely exist in one of these states – they’re likely to exist in a superposition of these states.

So let’s say that the probability of your partner being a jerk is $P(J)$ , which makes the probability of him/her being “nice” at $P(N) = 1- P(J)$ . Now when he/she does or says something (let’s call this event $E$ ), you implicitly do a Bayesian updation of these probabilities.

For every word/action of your partner, you can estimate the probabilities in the two cases of your partner being jerk, and nice. After every action E by the partner, you update your priors about them with the new information.

So the new probability of him being a jerk (given event E) will be given by
$P(J|E) = \frac{P(J).P(E|J)}{P(J).P(E|J) + P(N).P(E|N)}$ (the standard Bayesian formula).

Now notice that the new probability of the partner being a jerk is dependent upon the prior probability. So when $P(J)$ is already high, it is highly likely that whatever action the partner does will not move the needle significantly. And the longer $P(J)$ stays high, the higher the probability that you’ll lapse into a fight again. Hence the prolonged periods of fighting, and serial correlation.

This equation also explains why attempts to resolve a fight quickly can backfire. When you are fighting, the normal reaction to resolve it is by committing actions that indicate that you are actually nice. The problem is that the equation above has both $P(E|N)$ and $P(E|J)$ in it.

So, in order to resolve a fight, you should not only commit actions that you would do when you are perceived nice, but also actions that you would NOT do if you are a jerk. In other words, the easiest way to pull $P(J)$ down in the above equation is to commit $E$ with high $P(E|N)$ and low $P(E|J)$ .

What complicates things is that if you use one such weapon too many times, the partner will will begin to see through you, and up her $P(E|J)$ for this event. So you need to keep coming up with new tricks to defuse fights.

In short, that serial correlation exists in relationship fights is a given, and there is little you can do to prevent it. So if you go through a long period of continuous disagreement with your partner, keep in mind that such things are par for the course, and don’t do something drastic like breaking up.

Horses, Zebras and Bayesian reasoning

David Henderson at Econlog quotes a doctor on a rather interesting and important point, regarding Bayesian priors. He writes:

Later, when I went to see his partner, my regular doctor, to discuss something else, I mentioned that incident. He smiled and said that one of the most important lessons he learned from one of his teachers in medical school was:

When you hear hooves, think horses, not zebras.

This was after he had some symptoms that are correlated with heart attack and panicked and called his doctor, got treated for gas trouble and was absolutely fine after that.

Our problem is that when we have symptoms that are correlated with something bad, we immediately assume that it’s the bad thing that has happened, and panic. In that process we don’t consider alternate reasonings, and then do a Bayesian analysis.

Let me illustrate with a personal example. Back when I was a schoolboy, and I wouldn’t return home from school at the right time, my mother would panic. This was the time before cellphones, remember, and she would just assume that “the worst” had happened and that I was in trouble somewhere. Calls would go to my father’s office, and he would ask her to wait, though to my credit I was never so late that they had to take any further action.

Now, coming home late from school can happen due to a variety of reasons. Let us eliminate reasons such as wanting to play basketball for a while before returning – since such activities were “usual” and been budgeted for. So let’s assume that there are two possible reasons I’m late – the first is that I had gotten into trouble – I had either been knocked down on my way home or gotten kidnapped. The second is that the BTS (Bangalore Transport Service, as it was then called) schedule had gone completely awry, thanks to which I had missed my usual set of buses, and was thus delayed. Note that me not turning up at home until a certain point of time was a symptom of both of these.

Having noticed such a symptom, my mother would automatically come to the “worst case” conclusion (that I had been knocked down or kidnapped), and panic. But then I’m not sure that was the more rational reaction. What she should have done was to do a Bayesian analysis and use that to guide her panic.

Let A be the event that I’d been knocked over or kidnapped, and B be the event that the bus schedule had gone awry. Let L(t) be the event that I haven’t gotten home till time t, and that such an event has been “observed”. The question is that, with L(t) having been observed, what are the odds of A and B having happened? Bayes Theorem gives us an answer. The equation is rather simple:

P(A | L(t) ) = P(A).P(L(t)|A) / (P(A).P(L(t)|A) + P(B).P(L(t)|B) )

$P(B|L(t))$ is just one minus the above quantity (we assume that there is nothing else that can cause L(t)) .

So now let us give values. I’m too lazy to find the data now, but let’s say we find from the national crime data that the odds of a fifteen-year-old boy being in an accident or kidnapped on a given day is one in a million. And if that happens, then L(t) obviously gets observed. So we have

$P(A) = \frac{1}{1000000}$
$P(L(t) | A) = 1$

The BTS was notorious back in the day for its delayed and messed up schedules. So let us assume that P(B) is $\frac{1}{100}$ . Now, $P(L(t)|B)$ is tricky, and the reason the (t) qualifier has been added to L. The larger t is, the smaller the value of L(t)|B. If there is a bus schedule breakdown, there is probably a 50% probability that I’m not home an hour after “usual”. But there is only a 10% probability that I’m not home two hours after “usual” because a bus breakdown happened. So

$P(L(1)|B) = 0.5$
$P(L(2)|B) = 0.1$

Now let’s plug in and based on how delayed I was, find the odds that I was knocked down/kidnapped. If I were late by an hour,
$P(A|L(1)) =$ $\frac{ \frac{1}{1000000} \ 1 }{ \frac{1}{1000000} \ 1 + \frac{1}{100} \ 0.5}$
or $P(A|L(1)) = 0.00019996$ . In other words, if I didn’t get home an hour later than usual, the odds that I had been knocked down or kidnapped was just one in five thousand!

What if I didn’t come home two hours after my normal time? Again we can plug into the formula, and here we find that $P(A|L(2)) = 0.000999$ or one in a thousand! So notice that the later I am, the higher the odds that I’m in trouble. Yet, the numbers (admittedly based on the handwaving assumptions above) are small enough for us to not worry!

Bayesian reasoning has its implications elsewhere, too. There is the medical case, as Henderson’s blogpost illustrates. Then we can use this to determine whether a wrong act was due to stupidity or due to malice. And so forth.

But what Henderson’s doctor told him is truly an immortal line:

When you hear hooves, think horses, not zebras.