Bad Data Analysis

This is a post tangentially related to work, so I must point out that all views here are my own, and not views of my employer or anyone else I’m associated with

The good thing about data analysis is that it’s inherently easy to do. The bad thing about data analysis is also that it’s inherently easy to do – with increasing data democratisation in companies, it is easier than ever than pulling some data related to your hypothesis, building a few pivot tables and charts on Excel and then presenting your results.

Why is this a bad thing, you may ask – the reason is that it is rather easy to do bad data analysis. I’m never tired of telling people who ask me “what does the data say?”, “what do you want it to say? I can make it say that”. This is not a rhetorical statement. As the old saying goes, you can “take data down into the basement and torture it until it confesses to your hypothesis”.

So, for example, when I hire analysts, I don’t check as much for the ability to pull and analyse data (those can be taught) as I do for their logical thinking skills. When they do a piece of data analysis, are they able to say that it makes sense or not? Can they identify that some correlations data shows are spurious? Are they taking ratios along the correct axis (eg. “2% of Indians are below the poverty line”, versus “20% of the world’s poor is in India”)? Are they controlling for instrumental variables?

This is the real skill in analytics – are you able to draw logical and sensible conclusions from what the data says? It is no coincidence that half my team at my current job has been formally trained in economics.

One of the externalities of being a head of analytics is that you come across a lot of bad data analysis – you are yourself responsible for some of it, your team is responsible for some more and given the ease of analysing data, there is a lot from everyone else as well.

And it becomes part of your job to comment on this analysis, to draw sense from it, and to say if it makes sense or not. In most cases, the analysis itself will be immaculate – well written queries and logic / code. The problem, almost all the time, is in the logic used.

I was reading this post by Nabeel Qureshi on puzzles. There, he quotes a book on chess puzzles, and talks about the differences between how experts approach a problem compared to novices.

The lesson I found the most striking is this: there’s a direct correlation between how skilled you are as a chess player, and how much time you spend falsifying your ideas. The authors find that grandmasters spend longer falsifying their idea for a move than they do coming up with the move in the first place, whereas amateur players tend to identify a solution and then play it shortly after without trying their hardest to falsify it first. (Often amateurs, find reasons for playing the move — ‘hope chess’.)

Call this the ‘falsification ratio’: the ratio of time you spend trying to falsify your idea to the time you took coming up with it in the first place. For grandmasters, this is 4:1 — they’ll spend 1 minute finding the right move, and another 4 minutes trying to falsify it, whereas for amateurs this is something like 0.5:1 — 1 minute finding the move, 30 seconds making a cursory effort to falsify it.

It is the same in data analysis. If I think about the amount of time I spend in analysing data, a very very large percentage of it (can’t put a number since I don’t track my time) goes in “falsifying it”. “Does this correlation make sense?”; “Have I taken care of all the confounding variables?”; “Does the result hold if I take a different sample or cut of data?”. “Has the data I’m using been collected properly?”; “Are there any biases in the data that might be affecting the result?”; And so on.

It is not an easy job. One small adjustment here or there, and the entire recommendations might flip. Despite being rigorous with the whole process, you can leave in some inaccuracy. And sometimes what your data shows may not conform to the counterparty (who has much better domain knowledge)’s biases – and so you have a much harder job selling it.

And once again – when someone says “we have used data, so we have been rigorous about the process”, it is more likely that they are more wrong.

Risk and data

A while back a group of <a large number of scientists> wrote an open letter to the Prime Minister demanding greater data sharing with them. I must say that the letter is written in academic language and the effort to understand it was too much, but in the interest of fairness I’ll put a screenshot that was posted on twitter here.

I don’t know about this clinical and academic data. However, the holding back of one kind of data, in my opinion, has massively (and negatively) impacted people’s mental health and risk calculations.

This is data on mortality and risk. The kind of questions that I expect government data to have answered was:

  1. If I get covid-19 (now in the second wave), what is the likelihood that I will die?
  2. If my oxygen level drops to 90 (>= 94 is “normal”), what is the likelihood that I will die?
  3. If I go to hospital, what is the likelihood I will die?
  4. If I go to ICU what is the likelihood I will die?
  5. What is the likelihood of a teenager who contracts the virus (and is otherwise in good health) dying of the virus?

And so on. Simple risk-based questions whose answers can help people calibrate their lives and take calculated enough risks to get on with it without putting themselves and their loved ones at risk.

Instead, what we find from official sources are nothing but aggregates. Total numbers of people infected, dead, recovered and so on. And it is impossible to infer answers to the “risk questions” based no that.

And who fill in the gaps? Media of course.

I must have discussed “spectacularness bias” on this blog several times before. Basically the idea is that for something to be news, it needs to carry information. And an event carries information if it occurs despite having a low prior probability (or not occurring despite a high prior probability). As I put it in my lectures, “‘dog bites man’ is not news. ‘man bits dog’ is news”.

So when we rely on media reports to fill in our gaps in our risk systems, we end up taking all the wrong kinds of lessons. We learn that one seventeen year old boy died of covid despite being otherwise healthy. In the absence of other information, we assume that teenagers are under grave risk from the disease.

Similarly, cases of children looking for ICU beds get forwarded far more than cases of old people looking for ICU beds. In the absence of risk information, we assume that the situation must be grave among children.

Old people dying from covid goes unreported (unless the person was famous in some way or the other), since the information content in that is low. Young people dying gets amplified.

Based on all the reports that we see in the papers and other media (including social media), we get an entirely warped sense of what the risk profile of the disease is. And panic. When we panic, our health gets worse.

Oh, and I haven’t even spoken about bad risk reporting in the media. I saw a report in the Times of India this morning (unable to find a link to it) that said that “young are facing higher mortality in this wave”. Basically the story said that people under 60 account for a far higher proportion of deaths in the second wave than in the first.

Now there are two problems with that story.

  1. A large proportion of over 60s in India are vaccinated, so mortality is likely to be lower in this cohort.
  2. What we need is the likelihood of a person under 60 dying upon contracting covid. NOT the proportion of deaths accounted for by under 60s. This is the classic “averaging along the wrong axis” that they unleash upon you in the first test of any statistics course.

Anyway, so what kind of data would have helped?

  1. Age profile of people testing positive, preferably state wise (any finer will be noise)
  2. Age profile of people dying of covid-19, again state wise

I’m sure the government collects this data. Just that they’re not used to releasing this kind fo data, so we’re not getting it. And so we have to rely on the media and its spectacularness bias to get our information. And so we panic.

PS: By no means am I stating that covid-19 is not a risk. All I am stating is that the information we have been given doesn’t help us make good risk decisions

Election Counting Day

At the outset I must say that I’m deeply disappointed (based on the sources I’ve seen, mostly based on googling) with the reporting around the US presidential elections.

For example, if I google, I get something like “Biden leads Trump 225-213”. At the outset, that seems like useful information. However, the “massive discretisation” of the US electorate means that it actually isn’t. Let me explain.

Unlike India, where each of the 543 constituencies have a separate election, and the result of one doesn’t influence another, the US presidential election is at the state level. In all but a couple of small states, the party that gets most votes in the state gets all the votes of that state. So something like California is worth 55 votes. Florida is  worth 29 votes. And so on.

And some of these states are “highly red/blue” states, which means that they are extremely likely to vote for one of the two parties. For example, a victory is guaranteed for the Democrats in California and New York, states they had won comprehensively in the 2016 election (their dominance is so massive in these states that once a friend who used to live in New York had told me that he “doesn’t know any Republican voters”).

Just stating Biden 225 – Trump 213 obscures all this information. For example, if Biden’s 225 excludes California, the election is as good as over since he is certain to win the state’s 55 seats.

Also – this is related to my rant last week about the reporting of the opinion polls in the US – the front page on Google for US election results shows the number of votes that each candidate has received so far (among votes that have been counted). Once again, this is highly misleading, since the number of votes DOESN’T MATTER – what matters is the number of delegates (“seats” in an Indian context) each candidate gets, and that gets decided at the state level.

Maybe I’ve been massively spoilt by Indian electoral reporting, pioneered by the likes of NDTV. Here, it’s common to show the results and leads along with margins. It is common to show what the swing is relative to the previous elections. And some publications even do “live forecasting” of the total number of seats won by each party using a variation of the votes to seats model that I’ve written about.

American reporting lacks all of this. Headline numbers are talked about. “Live reports” on sites such as Five Thirty Eight are flooded with reports of individual senate seats, which to me sitting halfway round the world, is noise. All I care about is the likelihood of Trump getting re-elected.

Reports talk about “swing states” and how each party has performed in these, but neglect mentioning which party had won it the last time. So “Biden leading in Arizona” is of no importance to me unless I know how Arizona had voted in 2016, and what the extent of the swing is.

So what would I have liked? 225-213 is fine, but can the publications project it to the full 538 seats? There are several “models” they can use for this. The simplest one is to assume that states that haven’t declared leads yet have voted the same way as they did in 2016. One level of complexity can be using the votes to seats model, by estimating swings from the states that have declared leads, and then applying it to similar states that haven’t given out any information. And then you can get more complicated, but you realise it isn’t THAT complicated.

All in all, I’m disappointed with the reporting. I wonder if the split of American media down political lines has something to do with this.

Trip To Indiranagar

The first time I recall going to Indiranagar was in 1992, when we purchased a used car from someone who used to live there. While walking from the nearest bus stop to the house of the previous owner of our car, we had taken a longish route, as my parents admired all the “beautiful houses” in the area.

Six years later I went to school in that part of town. The “beautiful houses” were still there, and I used to walk past them on my way to school from the bus stop every morning. While I found the culture of the place to be quite different from that of Jayanagar (where I lived), I found the part of town to be nice, and liked going there (though not necessarily for school).

And it was another 6-year gap after school before I resumed my visits to Indiranagar. This time round, it wasn’t as regular as going to school, and most of the time the agenda was eating. Indiranagar by the mid 2000s had a lot of wonderful restaurants serving a nice variety of cuisines. Some of these restaurants were also rather fancy, and so when I met up with college friends living in Bangalore from time to time, it was usually in one place of another in Indiranagar. I continued to find the place nice.

Marriage and child and change in profession have all meant that visits to Indiranagar have become less frequent, and most of them nowadays are work-related. I spend time in coffee shops there. I take the metro to go there. I occasionally walk around a bit from meeting to meeting, but don’t notice the surroundings around. Some eateries there continue to be nice, though there are a lot more of them nowadays than before.

Something snapped today when we went there for lunch.

Lunch was at “Burma Burma” which the wife had rather hyped up over the years, and where it is reportedly incredibly hard to find a table. The drive to there was smooth, the car was handed over to the valet, and off we went inside to our table. The service was excellent, but the food was so-so. I’ve never eaten burmese food in my life so I don’t know if Burmese food is supposed to taste that way, but it tasted extremely Indian. Moreover the food was “low density” – I ate until my stomach was full but still didn’t feel like I’d gotten sufficient energy.

It was after the meal that I realised how much Indiranagar has changed, and not for the better. Immediately after we got out of the restaurant, I ran after the valet to tell him to leave my car where it was (on a side road) since I had “other business on the road”.

I wanted to check out the newly opened Blue Tokai Coffee Shop, also on 12th main. The walk to get there was horrendous. It was only 200 metres from Burma Burma (made a bit longer by our walking for a bit in the wrong direction), but it was impossible to walk anywhere but in the middle of the road. Footpaths were fully occupied by trees, dug up drains and parked vehicles. And there was a continuous line of parked vehicles right next to the footpath.

It was as if the 12th Main (the same road on which I would walk to school) area has been redesigned such that you drive from shop to shop, giving your car to valets who will then proceed to park it in some side road.

Oh, and Blue Tokai is a non-starter. It’s a small space on the first floor with acoustics so bad that one loud group in the place can render the whole place unbearable. It didn’t help that they took forever to take our order, and we decided to decamp to the (tried and trusted, for me) Third Wave Coffee Roasters on CMH Road.

And that meant another walk, though we eschewed 12th main this time, and then a short drive. Both of us noticed that the roads of Indiranagar seemed narrower than what we remembered – maybe the multitude of restaurants there means valets keep parking all through the inside roads, and double parked roads can be narrow indeed. And the area around CMH where Third Wave is located isn’t particularly nice either.

It seems to me that Indiranagar is not posh any more. In a way it was so posh at one point in time that everyone sought to set up shop there, and all the shops meant that the area has lost its character. The “beautiful houses” are being torn down one by one, replaced by commercial buildings full of restaurants, cars parked by whose valets will flood more and more of the inner roads, and make the entire area unwalkable.

I’m pretty sure most of the posh people in the area have left, having sold their houses into the real estate boom. I just wonder where they have moved to!

PS: The coffee at Third Wave was incredibly bad as well. It’s not usually so – I keep saying that they’re the best coffee shop in Bangalore. The milk today was scalding hot, and the barista poured so much of it in our cups, and without any of the finesse you associate with flat white, that it was completely tasteless.

 

 

Elite institutions and mental illness

At the Aditya Birla scholarship function last night, I met an old professor, who happened to remember me. We were exchanging emails today and he happened to ask me about one of my classmates, who passed away last year. In reply to him I went off on a long rant about the incidence of mental illness in institutions such as IIT. Some of what I wrote, I thought, deserved a wider audience, so I’m posting an edited version here. I’ve edited out people’s names to protect their privacy. 

****
<name blanked out 1> passed away a year and a half back. He had been diagnosed with bipolar disorder and was also suffering from depression, and he committed suicide. He had been living in Bangalore in his last days, working with an IT company here. I had invited him for my wedding a couple of months before that, but had got no response from him.
He was the third person that our 30-odd strong Computer Science class from IIT Madras lost. Prior to that another of our classmates had killed himself, and he too was known to be suffering from some form of mental illness. The third was a victim of a motorcycle accident.
I’m quite concerned about the incidence of mental illness among elite students. From my IIT Madras Computer Science class alone, I know at least six people who at some point of time or the other have been diagnosed with mental illness. I myself have been under treatment for depression and ADHD since January this year. And I don’t think our class is a particularly skewed sample.
I think this is a manner of great concern, and doesn’t get the attention that it deserves. I don’t know if there are some systemic issues that are causing this, but losing graduates of elite institutions to mental illness is I think a gross wastage of resources! I don’t know what really needs to be done, but I think one thing that is certainly going to help is to set up on-campus psychiatric and counselling services (manned by trained professionals; I know IIT Madras has a notional “guidance and counselling unit” but I’m not sure what kind of counseling they’re really capable of) , and to encourage students to seek help when they sense a problem.
Of course, there are other constraints at play here – firstly there is a shortage of trained psychiatrists in India. I remember reading a report somewhere that compared to international standards, India has only one third of the number of psychiatrists it requires. More importantly, there is the social stigma related to mental illness which prevents people from seeking professional help (I must admit that I faced considerable opposition from my own family when I wanted to consult a psychiatrist), and sometimes by the time people do get help, irreparable damage might have been done in terms of career. Having read up significantly on mental illnesses for a few months now, and looking back at my own life, I think I had been depressed ever since 2000, when I joined IIT. And it took over 11 years before I sought help and got it diagnosed. Nowadays, I try to talk about my mental illness in public forums, to try and persuade people that there is nothing shameful in being mentally ill, and to encourage them to seek help as soon as possible when they think they have a problem.
I’m really sorry I’ve gone off on a tangent here but this is a topic that I feel very strongly about, and got reminded about when I started thinking about <name blanked out 1>….
****
I know that several universities abroad offer free psychiatric support to students, and I know a number of friends who have taken advantage of such programs and gotten themselves diagnosed, and are leading significantly better lives now. I don’t really know how to put it concisely but if you think you suffer from some mental illness, I do encourage you to put aside the stigmas of yourself and your family, and go ahead and seek help. 

Cribbing

When this blog was young, I used to crib a lot. Ok let me correct that. When my livejournal, which is the predecessor of this blog, was young, I used to crib a lot. At least half my posts were “crib posts”. They would go on the lines of “oh i’m feeling so crappy. everything’s awful with the world”. I’d occasionally get comments. They’d either be of the “yeah you’ve done wrong” variety or “ok i empathise” type. Most such comments didn’t get any posts at all. Sorry, I meant that most such posts didn’t get any comments at all.

I decided to obey the market and moved away from crib posts. i still do crib once in a while, and use this blog as a personal rant, i don’t crib here as much as I used to. This “adjustment  to the needs of the market” has had its own problem. It sometimes makes yo ugo to the other extreme. Where you are just not able to crib at all. You feel guilty about cribbing. Everytime you crib, you think yo uare bothering someone and so you should stop. You stop.

Cribbing is an art. Not everyone is a master of cribbing. The biggest problem with the lack of ability to crib is what I call as the “two-person theory”. It is something like each of us is a superposition of several people. And at each point in time, we “collapse” to one of those people. All of us live in the form of a dynamic equilibrium as me. We share a hard disk, but we don’t share memory. And usually, there is no coexistence. Ok I suppose the name two-person theory is some kind of a misnomer. It is actually the trivial case where you are made of a superposition of just two people.

When you are low, you want to crib. You want to pour out all your woes to the world. You want to cry. You want to be cared for. You want to be  cuddled. But you can’t speak. You don’t have the confidence to speak. You just don’t feel like speaking. Speaking is an effort. And you end up not cribbing, though you really wanted to crib.

You recover. And you are now not who you were when you were low.  You are in a different state (no, this is not like going from Delhi to Haryana). You are able to communicate now. You remember that you needed to crib. That much has been coded into the hard disk. However, the details of that were left in the memory of the other you. You don’t know what to crib about. You don’t understand the other yourself. You trivialize your body-sharer. You decide you are better off not cribbing. And you are happy that you didn’t feel. Each time you think about it, you feel happeier that you didn’t crib; until you want to crib again, and are too low to communicate.

Now you blame the other you for not speaking out for you. Effectively you blame yourself. You feed in into the downward spiral. You want to crib even more now. And you can’t communicate. You wait till you get better. And then you wait till you get worse. You cycle. You oscillate.

One of you is dumb and cna’t communicate. The other of you can’t understand the other of you. This other doesn’t want to speak for the other other. One of the others is unable to communicate. And like the mythical Bherunda birds (also the state bird of Karnataka) one of the other consumes poison, taking the other other down with him.

Lazy Post – Statistical Analysis

I call this a lazy post since I didn’t originally write it as a blog post. I had written this as an email to a mailing list, and now thought it might make sense as a blog post. The reference to context: a prominent and well-respected member of the group had written a fairly lengthy argument, and ended it by saying “Maybe this calls for a good regression analysis….” . My reply is here.

I need to mention here that this mail to the group wasn’t responded to (apart from one tangential remark by  Udupa). I don’t know if it simply got lost in the flood of mails on the list today, or if people on the group (in general, a very intelligent lot) don’t care for this kind of stuff, or if, for some reason, this caused discomfort of some sort. Anyway, I begin:

I think I had raised this point before in a similar context. it is about the use and misuse of statistical analysis. i think one lesson that ought to be learnt from the ongoing financial crisis and the events leading up to this is that statistical analysis, when misused, can have dangerous consequences, and this is not just for the people who are misusing the analyses.

there is this popular view that if there is data, then one ought to do statistical analysis, and draw conclusions from that, and make decisions based on these conclusions. unfortunately, in a large number of cases, the analysis ends up being done by someone who is not very proficient with statistics and who is basically applying formulae rather than using a concept. as long as you are using statistics as concepts, and not as formulae, I think you are fine. but you get into the “ok i see a time series here. let me put regression. never mind the significance levels or stationarity or any other such blah blah but i’ll take decisions based on my regression” then you are likely to get into trouble.

i think this is broadly the kind of point that is made by people like Paul Wilmott. that the problem is not with statistical analysis, but  with the way people use statistical analysis.

ok, now that i’m done with my rant, I’m very sceptical about regression yielding any kind of conclusive results here. i think the number of data points we have here is too small to produce any meaningful results. of course i’m saying this without really looking at all the data that you want to might want to include. and i won’t be surprised if a few tens of papers get published on this topic. all based on statistical analyses. and the results all being orthogonal to one another.

The Problem with Amit Varma

Ok, at the outset I must admit that the title is misleading. There is no problem with Amit Varma. He is an excellent fellow, and great fun to hang out with. What I have been having a problem with is his blog, the ever-so-popular India Uncut. I’ve been reading it regularly for over two years, and now suddenly there seems to be something stale about it.

Amit’s method of writing is what I call  as “reinforcement writing”. Other proponents of this style include Ajay Shah and Percy Mistry, and several commies whose writing I don’t care to read. The thing with these people is that they have one basic idea. And 60%  (ok i made up that number) of all their essays talk about this one idea. The rest 40% (made up, once again) is used to present the same idea in a different context, or to paint it using a different colour. Maybe their hope is that when people read about the same idea several times, they will get convinced and buy the idea. The concept here is that every time people read about this idea, they would have forgotten that the last time they heard about this was from the same source, and their belief in this idea gets reinforced.

Amit’s chosen idea is one of liberty. Like classic libertarians define themselves – “free markets and free minds”. Go through all of Amit’s serious essays (basically discounting his essays on cows), and you will find this to be the unifying thread. Go back, and look at all the Thursday editions of Mint between Feb07 and Feb08. I haven’t kept count, but my sense is that at least 40 of those 50 odd columns had liberty as its underlying theme.

It may be the case that his mandate in that column was to write about liberty, but this concept has now become big in his normal blogging also, and maybe in all his thoughts. A few weeks ago, I wrote to a mailing list that Amit and I are both part of saying that I was planning to write an Economic Travelogue. And Amit’s quick suggestion was that I should write it from a liberty and freedom point of view.

The reason I’m writing this essay is that as a regualr and loyal reader of Amit’s blog, I feel cheated. I feel cheated that he isn’t adding much value by way of his posts. I’m not cribbing about the volume here, since I know that he is busy trying to get a novel published and doesn’t have much time to blog. I’m not cribbing about the quality of writing here – as always, it is excellent. What I am cribbing about is the content. That – for a regular reader – the marginal value of each of his posts is infinitesimal. And the danger is that the marginal value of looking at his posts might soon turn negative, which will result in my unsubscribing from his blog.

I know that Amit is really good about writing about liberty. He even won a Bastiat for that. However, he needs to realize that most of his audience is “repeat audience”. His essays might make a big impact for a new reader, but they do little to please existing ones. This is like a business that spends so much of its energy in acquiring new customers that it ends up pissing off the existing customers. Such a business model is never going to be sustainable.