bayes theorem – Pertinent Observations

Bayes Theorem and Respect

Regular readers of this blog will know very well that I keep talking about how everything in life is Bayesian. I may not have said it in those many words, but I keep alluding to it.

For example, when I’m hiring, I find the process to be Bayesian – the CV and the cover letter set a prior (it’s really a distribution, not a point estimate). Then each round of interview (or assignment) gives additional data that UPDATES the prior distribution. The distribution moves around with each round (when there is sufficient mass below a certain cutoff there are no more rounds), until there is enough confidence that the candidate will do well.

In hiring, Bayes theorem can also work against the candidate. Like I remember interviewing this guy with an insanely spectacular CV, so most of the prior mass was to the “right” of the distribution. And then when he got a very basic question so badly wrong, the updation in the distribution was swift and I immediately cut him.

On another note, I’ve argued here about how stereotypes are useful – purely as a Bayesian prior when you have no more information about a person. So you use the limited data you have about them (age, gender, sex, sexuality, colour of skin, colour of hair, education and all that), and the best judgment you can make at that point is by USING this information rather than ignoring it. In other words, you need to stereotype.

However, the moment you get more information, you ought to very quickly update your prior (in other words, the ‘stereotype prior’ needs to be a very wide distribution, irrespective of where it is centred). Else it will be a bad judgment on your part.

In any case, coming to the point of this post, I find that the respect I have for people is also heavily Bayesian (I might have alluded to this while talking about interviewing). Typically, in case of most people, I start with a very high degree of respect. It is actually a fairly narrowly distributed Bayesian prior.

And then as I get more and more information about them, I update this prior. The high starting position means that if they do something spectacular, it moves up only by a little. If they do something spectacularly bad, though, the distribution moves way left.

So I’ve noticed that when there is a fall, the fall is swift. This is again because of the way the maths works – you might have a very small probability of someone being “bad” (left tail). And then when they do something spectacularly bad (well into that tail), there is no option but to update the distribution such that a lot of the mass is now in this tail.

Once that has happened, unless they do several spectacular things, it can become irredeemable. Each time they do something slightly bad, it confirms your prior that they are “bad” (on whatever dimension), and the distribution narrows there. And they become more and more irredeemable.

It’s like “you cannot unsee” the event that took their probability distribution and moved it way left. Soon, the end is near.

Lessons from poker party

In the past I’ve drawn lessons from contract bridge on this blog – notably, I’d described a strategy called “queen of hearts” in order to maximise chances of winning in a game that is terribly uncertain. Now it’s been years since I played bridge, or any card game for that matter. So when I got invited for a poker party over the weekend, I jumped at the invitation.

This was only the second time ever that I’d played poker in a room – I’ve mostly played online where there are no monetary stakes and you see people go all in on every hand with weak cards. And it was a large table, with at least 10 players being involved in each hand.

A couple of pertinent observations (reasonable return for the £10 I lost that night).

Firstly a windfall can make you complacent. I’m usually a conservative player, bidding aggressively only when I know that I have good chances of winning. I haven’t played enough to have mugged up all the probabilities – that probably offers an edge to my opponents. But I have a reasonable idea of what constitutes a good hand and bid accordingly.

My big drawdown happened in the hand immediately after I’d won big. After an hour or so of bleeding money, I’d suddenly more than broken even. That meant that in my next hand, I bid a bit more aggressively than I would have for what I had. For a while I managed to stay rational (after the flop I knew I had a 1/6 chance of winning big, and having mugged up the Kelly Criterion on my way to the party, bid accordingly).

And when the turn wasn’t to my liking I should’ve just gotten out – the (approx) percentages didn’t make sense any more. But I simply kept at it, falling for the sunk cost fallacy (what I’d put in thus far in the hand). I lost some 30 chips in that one hand, of which at least 21 came at the turn and the river. Without the high of having won the previous hand, I would’ve played more rationally and lost only 9. After all the lectures I’ve given on logic, correlation-causation and the sunk cost fallacy, I’m sad I lost so badly because of the last one.

The second big insight is that poverty leads to suboptimal decisions. Now, this is a well-studied topic in economics but I got to experience it first hand during the session. This was later on in the night, as I was bleeding money (and was down to about 20 chips).

I got pocket aces (a pair of aces in hand) – something I should’ve bid aggressively with. But with the first 3 open cards falling far away from the face cards and being uncorrelated, I wasn’t sure of the total strength of my hand (mugging up probabilities would’ve helped for sure!). So when I had to put in 10 chips to stay in the hand, I baulked, and folded.

Given the play on the table thus far, it was definitely a risk worth taking, and with more in the bank, I would have. But poverty and the Kelly Criterion meant that the number of chips that I was able to invest in the arguably strong hand was limited, and that limited my opportunity to profit from the game.

It is no surprise that the rest of the night petered out for me as my funds dwindled and my ability to play diminished. Maybe I should’ve bought in more when I was down to 20 chips – but then given my ability relative to the rest of the table, that would’ve been good money after bad.

Bayesian Recognition and the Inverse Charlie Chaplin Principle

So I bumped into Deepa at a coffee shop this evening. And she almost refused to recognise me. It turned out to be a case of Bayesian Recognition having gone wrong. And then followed in quick succession by a case of Inverse Charlie Chaplin Principle.

So I was sitting at this coffee shop in Jayanagar meeting an old acquaintance, and Deepa walked in, along with a couple of other people. It took me a while to recognise her, but presently I did, and it turned out that by then she was seated at a table such that we were directly facing each other, with some thirty feet between us (by now I was positive it was her).

I looked at her for a bit, waiting for her to recognise me. She didn’t. I got doubts on whether it was her, and almost took out my phone to message and ask her if it was indeed her. But then I decided it was a silly thing to do, and I should go for it the natural way. So I looked at her again, and looked at her for so long that if she were a stranger she would have thought I was leching at her (so you know that I was quite confident now that it was indeed Deepa). No response.

I started waving, with both arms. She was now looking at me, but past me. I continued waving, and I don’t know what my old acquaintance who I was talking to was thinking by now. And finally a wave back. And we got both got up, and walked towards each other, and started talking.

The Charlie Chaplin principle comes from this scene in a Charlie Chaplin movie which I can’t remember right now where he is standing in front of a statue of the king. Everyone who goes past him salutes him, and he feels high that everyone is saluting him, while everyone in effect is saluting the statue of the king behind him.

Thus, the “Charlie Chaplin Principle” refers to the case where you think someone is smiling at you or waving at you or saluting you, and it turns out that they are doing that to someone who is collinear with you and them. Thus, you are like Charlie Chaplin, stupidly feeling happy about this person smiling/waving/saluting at you while it is someone else that they are addressing.

Like all good principles, this one too has an inverse – which we shall call the “Inverse Charlie Chaplin Principle”. In this one, someone is smiling or waving or blowing kisses at you, and you assume that the gesture is intended to someone else who is collinear with the two of you. Thus, you take no notice of the smile or wave or blown kiss, and get on with life, with the likelihood that you are pissing off the person who is smiling or waving or blowing kisses at you!

Both these effects have happened to me a few times, and I’ve been on both sides of both effects. And an instance of the Inverse principle happened today.

Deepa claimed that she initially failed to recognise me because she assumed that I’m in Spain, and that thus there’s no chance I would be in Jayanagar this evening (clearly she reads this blog, but not so regularly!). Thus, she eliminated me from her search space and was unable to fit my face to anyone else she knows.

Then when I started waving, the Inverse Charlie Chaplin Principle took over. Bizarrely (there was no one between us in the cafe save the acquaintance I was talking to, and I wouldn’t be waving wildly at someone at the same table for two as me; and Deepa was sitting with her back to the wall of the cafe so I presumably could not have been waving at anyone else behind her), she assumed that I was waving to someone else (or so that was her claim), and that it took time for her to realise that it was her that I was actually waving to!

Considering how Bayesian Recognition can throw you off, I’m prepared to forgive her. But I didn’t imagine that Bayesian Recognition would throw her off so much that it would cause an Inverse Charlie Chaplin Effect on her!

Oh, I must mention that I have grown a stubble (the razor I took on my trip to Europe was no good), and that she mentioned about not wearing her glasses today. Whatever!

Review: The Theory That Would Not Die

I was introduced to Bayes’ Theorem of Conditional Probabilities in a rather innocuous manner back when I was in Standard 12. KVP Raghavan, our math teacher, talked about pulling black and white balls out of three different boxes. “If you select a box at random, draw two balls and find that both are black, what is the probability you selected box one?” , he asked and explained to us the concept of Bayes’ Theorem. It was intuitive, and I accepted it as truth.

I wouldn’t come across the theorem, however, for another four years or so, until in a course on Communication, I came across a concept called “Hidden Markov Models”. If you were to observe a signal, and it could have come out of four different transmitters, what are the odds that it was generated by transmitter one? Once again, it was rather intuitive. And once again, I wouldn’t come across or use this theorem for a few years.

A couple of years back, I started following the blog of Columbia Statistics and Social Sciences Professor Andrew Gelman. Here, I came across the terms “Bayesian” and “non-Bayesian”. For a long time, the terms baffled me to no end. I just couldn’t get what the big deal about Bayes’ Theorem was – as far as I was concerned it was intuitive and “truth” and saw no reason to disbelieve it. However, Gelman frequently allured to this topic, and started using the term “frequentists” for non-Bayesians. It was puzzling as to why people refused to accept such an intuitive rule.

The Theory That Would Not Die is Shannon Bertsch McGrayne’s attempt to tell the history of the Bayes’ Theorem. The theorem, according to McGrayne,

survived five near-fatal blows: Bayes had shelved it; Price published it but was ignored; Laplace discovered his own version but later favored his frequency theory; frequentists virstually banned it; and the military kept it secret.

The book is about the development of the theorem and associated methods over the last two hundred and fifty years, ever since Rev. Thomas Bayes first came up with it. It talks about the controversies associated with the theorem, about people who supported, revived or opposed it; about key applications of the theorem, and about how it was frequently and for long periods virtually ostracized.

While the book is ostensibly about Bayes’s Theorem, it is also a story of how science develops, and comes to be. Bayes proposed his theorem but didn’t publish it. His friend Price put things together and published it but without any impact. Laplace independently discovered it, but later in his life moved away from it, using frequency-based methods instead. The French army revived it and used it to determine the most optimal way to fire artillery shells. But then academic statisticians shunned it and “Bayes” became a swearword in academic circles. Once again, it saw a revival at the Second World War, helping break codes and test weapons, but all this work was classified. And then it found supporters in unlikely places – biology departments, Harvard Business School and military labs, but statistics departments continued to oppose.

The above story is pretty representative of how a theory develops – initially it finds few takers. Then popularity grows, but the establishment doesn’t like it. It then finds support from unusual places. Soon, this support comes from enough places to build momentum. The establishment continues to oppose but is then bypassed. Soon everyone accepts it, but some doubters remain..

Coming back to Bayes’ Theorem – why is it controversial and why was it ostracized for long periods of time? Fundamentally it has to do with the definition of probability. According to “frequentists”, who should more correctly be called “objectivists”, probability is objective, and based on counting. Objectivists believe that probability is based on observation and data alone, and not from subjective beliefs. If you ask an objectivist, for example, the probability of rain in Bangalore tomorrow, he will be unable to give you an answer – “rain in Bangalore tomorrow” is not a repeatable event, and cannot be observed multiple times in order to build a model.

Bayesians, who should be more correctly be called “subjectivists”, on the other hand believe that probability can also come from subjective beliefs. So it is possible to infer the probability of rain in Bangalore tomorrow based on other factors – like the cloud cover in Bangalore today or today’s maximum temperature. According to subjectivists (which is the current prevailing thought), probability for one-time events is also defined, and can be inferred from other subjective factors.

Essentially, the the battle between Bayesians and frequentists is more to do with the definition of probability than with whether it makes sense to define inverse probabilities as in Bayes’ Theorem. The theorem is controversial only because the prevailing statistical establishment did not agree with the “subjectivist” definition of probability.

There are some books that I call as ‘blog-books’. These usually contain ideas that could be easily explained in a blog post, but is expanded into book length – possibly because it is easier to monetize a book-length manuscript than a blog-length one. When I first downloaded a sample of this book to my Kindle I was apprehensive that this book might also fall under that category – after all, how much can you talk about a theorem without getting too technical? However, McGrayne avoids falling into that trap. She peppers the book with interesting stories of the application of Bayes’ Theorem through the years, and also short biographical tidbits of some of the people who helped shape the theorem. Sometimes (especially towards the end) some of these examples (of applications) seem a bit laboured, but overall, the books sustains adequate interest from the reader through its length.

If I had one quibble with the book, it would be that even after the descriptions of the story of the theorem, the book talks about “Bayesian” and ‘non-Bayesian” camps, and talk about certain scientists “not doing enough to further the Bayesian cause”. For someone who is primarily interested in getting information out of data, and doesn’t care about the methods involved, it was a bit grating that scientists be graded on their “contribution to the Bayesian cause” rather than their “contribution to science”. Given the polarizing history of the theorem, however, it is perhaps not that surprising.

The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down Russian Submarines, and Emerged Triumphant from Two Centuries of Controversy
by Sharon Bertsch McGrayne
USD 12.27 (Kindle edition)
360 pages (including appendices and notes)