69 is the answer

The IDFC-Duke-Chicago survey that concluded that 50% of Bangalore had covid-19 in late June only surveyed 69 people in the city. 

When it comes to most things in life, the answer is 42. However, if you are trying to rationalise the IDFC-Duke-Chicago survey that found that over 50% of people in Bangalore had had covid-19 by end-June, then the answer is not 42. It is 69.

For that is the sample size that the survey used in Bangalore.

Initially I had missed this as well. However, this evening I attended half of a webinar where some of the authors of the survey spoke about the survey and the paper, and there they let the penny drop. And then I found – it’s in one small table in the paper.

The IDFC-Duke-Chicago survey only surveyed 69 people in Bangalore

The above is the table in its glorious full size. It takes effort to read the numbers. Look at the second last line. In Bangalore Urban, the ELISA results (for antibodies) were available for only 69 people.

And if you look at the appendix, you find that 52.5% of respondents in Bangalore had antibodies to covid-19 (that is 36 people). So in late June, they surveyed 69 people and found that 36 had antibodies for covid-19. That’s it.

To their credit, they didn’t highlight this result (I sort of dug through their paper to find these numbers and call the survey into question). And they mentioned in tonight’s webinar as well that their objective was to get an idea of the prevalence in the state, and not just in one particular region (even if it be as important as Bangalore).

That said, two things that they said during the webinar in defence of the paper that I thought I should point out here.

First, Anu Acharya of MapMyGenome (also a co-author of the survey) said “people have said that a lot of people we approached refused consent to be surveyed. That’s a standard of all surveying”. That’s absolutely correct. In any random survey, you will always have an implicit bias because the sort of people who will refuse to get surveyed will show a pattern.

However, in this particular case, the point to note is the extremely high number of people who refused to be surveyed – over half the households in the panel refused to be surveyed, and in a further quarter of the panel households, the identified person refused to be surveyed (despite the family giving clearance).

One of the things with covid-19 in India is that in the early days of the pandemic, anyone found having the disease would be force-hospitalised. I had said back then (not sure where) that hospitalising asymptomatic people was similar to the “precogs” in Minority Report – you confine the people because they MIGHT INFECT OTHERS.

For this reason, people didn’t want to get tested for covid-19. If you accidentally tested positive, you would be institutionalised for a week or two (and be made to pay for it, if you demanded a private hospital). Rather, unless you had clear symptoms or were ill, you were afraid of being tested for covid-19 (whether RT-PCR or antibodies, a “representative sample” won’t understand).

However, if you had already got covid-19 and “served your sentence”, you would be far less likely to be “afraid of being tested”. This, in conjunction with the rather high proportion of the panel that refused to get tested, suggests that there was a clear bias in the sample. And since the numbers for Bangalore clearly don’t make sense, it lends credence to the sampling bias.

And sample size apart, there is nothing Bangalore-specific about this bias (apart from that in some parts of the state, the survey happened after people had sort of lost their fear of testing). This further suggests that overall state numbers are also an overestimate (which fits in with my conclusion in the previous blogpost).

The other thing that was mentioned in the webinar that sort of cracked me up was the reason why the sample size was so low in Bangalore – a lockdown got announced while the survey was on, and the sampling team fled. In today’s webinar, the paper authors went off on a rant about how surveying should be classified as an “essential activity”.

In any case, none of this matters. All that matters is that 69 is the answer.

 

More on Covid-19 prevalence in Karnataka

As the old song went, “when the giver gives, he tears the roof and gives”.

Last week the Government of Karnataka released its report on the covid-19 serosurvey done in the state. You might recall that it had concluded that the number of cases had been undercounted by a factor of 40, but then some things were suspect in terms of the sampling and the weighting.

This week comes another sero-survey, this time a preprint of a paper that has been submitted to a peer reviewed journal. This survey was conducted by the IDFC Institute, a think tank, and involves academics from the University of Chicago and Duke University, and relies on the extensive sampling network of CMIE.

At the broad level, this survey confirms the results of the other survey – it concludes that “Overall seroprevalence in the state implies that by August at least 31.5 million residents had been infected by August”. This is much higher than the overall conclusions of the state-sponsored survey, which had concluded that “about 19 million residents had been infected by mid-September”.

I like seeing two independent assessments of the same quantity. While each may have its own sources of error, and may not independently offer much information, comparing them can offer some really valuable insights. So what do we have here?

The IDFC-Duke-Chicago survey took place between June and August, and concluded that 31.5 million residents of Karnataka (out of a total population of about 70 million) have been infected by covid-19. The state survey in September had suggested 19 million residents had been infected by September.

Clearly, since these surveys measure the number of people “who have ever been affected”, both of them cannot be correct. If 31 million people had been affected by end August, clearly many more than 19 million should have been infected by mid-September. And vice versa. So, as Ravi Shastri would put it, “something’s got to give”. What gives?

Remember that I had thought the state survey numbers might have been an overestimate thanks to inappropriate sampling (“low risk” not being low risk enough, and not weighting samples)? If 20 million by mid-September was an overestimate, what do you say about 31 million by end August? Surely an overestimate? And that is not all.

If you go through the IDFC-Duke-Chicago paper, there are a few figures and tables that don’t make sense at all. For starters, check out this graph, that for different regions in the state, shows the “median date of sampling” and the estimates on the proportion of the population that had antibodies for covid-19.

Check out the red line on the right. The sampling for the urban areas for the Bangalore region was completed by 24th June. And the survey found that more than 50% of respondents in this region had covid-19 antibodies. On 24th June.

Let’s put that in context. As of 24th June, Bangalore Urban had 1700 confirmed cases. The city’s population is north of 10 million. I understand that 24th June was the “median date” of the survey in Bangalore city. Even if the survey took two weeks after that, as of 8th of July, Bangalore Urban had 12500 confirmed cases.

The state survey had estimated that known cases were 1 in 40. 12500 confirmed cases suggests about 500,000 actual cases. That’s 5% of Bangalore’s population, not 50% as the survey claimed. Something is really really off. Even if we use the IDFC-Duke-Chicago paper’s estimates that only 1 in 100 cases were reported / known, then 12500 known cases by 8th July translates to 1.25 million actual cases, or 12.5% of the city’s population (well below 50% ).

My biggest discomfort with the IDFC-Duke-Chicago effort is that it attempts to sample a rather rapidly changing variable over a long period of time. The survey went on from June 15th to August 29th. By June 15th, Karnataka had 7200 known cases (and 87 deaths). By August 29th the state had 327,000 known cases and 5500 deaths. I really don’t understand how the academics who ran the study could reconcile their data from the third week of June to the data from the third week of August, when the nature of the pandemic in the state was very very different.

And now, having looked at this paper, I’m more confident of the state survey’s estimations. Yes, it might have sampling issues, but compared to the IDFC-Duke-Chicago paper, the numbers make so much more sense. So yeah, maybe the factor of underestimation of Covid-19 cases in Karnataka is 40.

Putting all this together, I don’t understand one thing. What these surveys have shown is that

  1. More than half of Bangalore has already been infected by covid-19
  2. The true infection fatality rate is somewhere around 0.05% (or lower).

So why do we still have a (partial) lockdown?

PS: The other day on WhatsApp I saw this video of an extremely congested Chickpet area on the last weekend before Diwali. My initial reaction was “these people have lost their minds. Why are they all in such a crowded place?”. Now, after thinking about the surveys, my reaction is “most of these people have most definitely already got covid and recovered. So it’s not THAT crazy”.

Covid-19 Prevalence in Karnataka

Finally, many months after other Indian states had conducted a similar exercise, Karnataka released the results of its first “covid-19 sero survey” earlier this week. The headline number being put out is that about 27% of the state has already suffered from the infection, and has antibodies to show for it. From the press release:

Out of 7.07 crore estimated populationin Karnataka, the study estimates that 1.93 crore (27.3%) of the people are either currently infected or already had the infection in the past, as of 16 September 2020.

To put that number in context, as of 16th September, there were a total of 485,000 confirmed cases in Karnataka (official statistics via covid19india.org), and 7536 people had died of the disease in the state.

It had long been estimated that official numbers of covid-19 cases are off by a factor of 10 or 20 – that the actual number of people who have got the disease is actually 10 to 20 times the official number. The serosurvey, assuming it has been done properly, suggests that the factor (as of September) is 40!

If the ratio has continued to hold (and the survey accurate), nearly one in two people in Karnataka have already got the disease! (as of today, there are 839,000 known cases in Karnataka)

Of course, there are regional variations, though I should mention that the smaller the region you take, the less accurate the survey will be (smaller sample size and all that). In Bangalore Urban, for example, the survey estimates that 30% of the population had been infected by mid-September. If the ratio holds, we see that nearly 60% of the population in the city has already got the disease.

The official statistics (separate from the survey) also suggest that the disease has peaked in Karnataka. In fact, it seems to have peaked right around the time the survey was being conducted, in September. In September, it was common to see 7000-1000 new cases confirmed in Karnataka each day. That number has come down to about 3000 per day now.

Now, there are a few questions we need to answer. Firstly – is this factor of 40 (actual cases to known cases) feasible? Based on this data point, it makes sense:

In May, when Karnataka had a very small number of “native cases” and was aggressively testing everyone who had returned to the state from elsewhere, a staggering 93% of currently active cases were asymptomatic. In other words, only 1 in 14 people who was affected was showing any sign of symptoms.

Then, as I might have remarked on Twitter a few times, compulsory quarantining or hospitalisation (which was in force until July IIRC) has been a strong disincentive to people from seeking medical help or getting tested. This has meant that people get themselves tested only when the symptoms are really clear, or when they need attention. The downside of this, of course, has been that many people have got themselves tested too late for help. One statistic I remember is that about 33% of people who died of covid-19 in hospitals died within 24 hours of hospitalisation.

So if only one in 14 show any symptoms, and only those with relatively serious symptoms (or with close relatives who have serious symptoms) get themselves tested, this undercount by a factor of 40 can make sense.

Then – does the survey makes sense? Is 15000 samples big enough for a state of 70 million? For starters, the population of the state doesn’t matter. Rudimentary statistics (I always go to this presentation by Rajeeva Karandikar of CMI)  tells us that the size of the population doesn’t matter. As long as the sample has been chosen randomly, all that matters for the accuracy of the survey is the size of the sample. And for a binary decision (infected / not), 15000 is good enough as long as the sample has been random.

And that is where the survey raises questions – the survey has used an equal number of low risk, high risk and medium risk samples. “High risk” have been defined as people with comorbidities. Moderate risk are people who interact a lot with a lot of people (shopkeepers, healthcare workers, etc.). Both seem fine. It’s the “low risk” that seems suspect, where they have included pregnant women and attendants of outpatient patients in hospitals.

I have a few concerns – are the “low risk” low risk enough? Doesn’t the fact that you have accompanied someone to hospital, or  gone to hospital yourself (because you are pregnant), make you higher than average risk? And then – there are an equal number of low risk, medium risk and high risk people in the sample and there doesn’t seem to be any re-weighting. This suggests to me that the medium and high risk people have been overrepresented in the sample.

Finally, the press release says:

We excluded those already diagnosed with SARS-CoV2 infection, unwilling to provide a sample for the test, or did not agree to provide informed consent

I wonder if this sort of exclusion doesn’t result in a bias in itself.

Putting all this together – that there are qual samples of low, medium and high risk, that the “low risk” sample itself contains people of higher than normal risk, and that people who have refused to participate in the survey have been excluded – I sense that the total prevalence of covid-19 in Karnataka is likely to be overstated. By what factor, it is impossible to say. Maybe our original guess that the incidence of the disease is about 20 times the number of known cases is still valid? We will never know.

Nevertheless, we can be confident that a large section of the state (may not be 50%, but maybe 40%?) has already been infected with covid-19 and unless the ongoing festive season plays havoc, the number of cases is likely to continue dipping.

However, this is no reason to be complacent. I think Nitin Pai is  bang on here.

And I know a lot of people who have been aggressively social distancing (not even meeting people who have domestic help coming home, etc.). It is important that when they do relax, they do so in a graded manner.

Wear masks. Avoid crowded closed places. If you are going to get covid-19 anyway (and many of us have already got it, whether we know it or not), it is significantly better for you that you get a small viral load of it.

Opinion polling in India and the US

(Relative) old-time readers of this blog might recall that in 2013-14 I wrote a column called “Election Metrics” for Mint, where I used data to analyse elections and everything else related to that. This being the election where Narendra Modi suddenly emerged as a spectacular winner, the hype was high. And I think a lot of people did read my writing during that time.

In any case, somewhere during that time, my editor called me “Nate Silver of India”.

I followed that up with an article on why “there can be no Nate Silver in India” (now they seem to have put it behind a sort of limited paywall). In that, I wrote about the polling systems in India and in the US, and about how India is so behind the US when it comes to opinion polling.

Basically, India has fewer opinion polls. Many more political parties. A far more diverse electorate. Less disclosure when it comes to opinion polls. A parliamentary system. And so on and so forth.

Now, seven years later, as we are close to a US presidential election, I’m not sure the American opinion polls are as great as I made them out to be. Sure, all the above still apply. And when these poll results are put in the hands of a skilled analyst like Nate Silver, it is possible to make high quality forecasts based on that.

However, the reporting of these polls in the mainstream media, based on my limited sampling, is possibly not of much higher quality than what we see in India.

Basically I don’t understand why analysts abroad make such a big deal of “vote share” when what really matters is the “seat share”.

Like in 2016, Hillary Clinton won more votes than Donald Trump, but Trump won the election because he got “more seats” (if you think about it, the US presidential elections is like a first past the post parliamentary election with MASSIVE constituencies (California giving you 55 seats, etc.) ).

And by looking at the news (and social media), it seems like a lot of Americans just didn’t seem to get it. People alleged that Trump “stole the election” (while all he did was optimise based on the rules of the game). They started questioning the rules. They seemingly forgot the rules themselves in the process.

I think this has to do with the way opinion polls are reported in the US. Check out this graphic, for example, versions of which have been floating around on mainstream and social media for a few months now.

This shows voting intention. It shows what proportion of people surveyed have said they will vote for one of the two candidates (this is across polls. The reason this graph looks so “continuous” is that there are so many polls in the US). However, this shows vote share, and that might have nothing to do with seat share.

The problem with a lot (or most) opinion polls in India is that they give seat share predictions without bothering to mention what the vote share prediction is. Most don’t talk about sample sizes. This makes it incredibly hard to trust these polls.

The US polls (and media reports of those) have the opposite problem – they try to forecast vote share without trying to forecast how many “seats” they will translate to. “Biden has an 8 percentage point lead over Trump” says nothing. What I’m looking for is something like “as things stand, Biden is likely to get 20 (+/- 15) more electoral college votes than Trump”. Because electoral college votes is what this election is about. The vote share (or “popular vote”, as they call it in the US (perhaps giving it a bit more legitimacy than it deserves) ), for the purpose of the ultimate result, doesn’t matter.

In the Indian context, I had written this piece on how to convert votes to seats (again paywalled, it seems like). There, I had put some pictures (based on state-wise data from general elections in India before 2014).

An image from my article for Mint in 2014 on converting votes to seats. Look at the bottom left graph

What I had found is that in a two-cornered contest, small differences in vote share could make a massive difference in the number of seats won. This is precisely the situation that they have in the US – a two cornered contest. And that means opinion polls predicting vote shares only should be taken with some salt.

Scrabble

I’ve forgotten which stage of lockdown or “unlock” e-commerce for “non-essential goods” reopened, but among the first things we ordered was a Scrabble board. It was an impulse decision. We were on Amazon ordering puzzles for the daughter, and she had just about started putting together “sounds” to make words, so we thought “scrabble tiles might be useful for her to make words with”.

The thing duly arrived two or three days later. The wife had never played Scrabble before, so on the day it arrived I taught her the rules of the game. We play with the Sowpods dictionary open, so we can check words that hte opponent challenges. Our “scrabble vocabulary” has surely improved since the time we started playing (“Qi” is a lifesaver, btw).

I had insisted on ordering the “official Scrabble board” sold by Mattel. The board is excellent. The tiles are excellent. The bag in which the tiles are stored is also excellent. The only problem is that there was no “scoreboard” that arrived in the set.

On the first day we played (when I taught the wife the rules, and she ended up beating me – I’m so horrible at the game), we used a piece of paper to maintain scores. The next day, we decided to score using an Excel sheet. Since then, we’ve continued to use Excel. The scoring format looks somewhat like this.

So each worksheet contains a single day’s play. Initially after we got the board, we played pretty much every day. Sometimes multiple times a day (you might notice that we played 4 games on 3rd June). So far, we’ve played 31 games. I’ve won 19, Priyanka has won 11 and one ended in a tie.

In any case, scoring on Excel has provided an additional advantage – analytics!! I have an R script that I run after every game, that parses the Excel sheet and does some basic analytics on how we play.

For example, on each turn, I make an average of 16.8 points, while Priyanka makes 14.6. Our score distribution makes for interesting viewing. Basically, she follows a “long tail strategy”. Most of the time, she is content with making simple words, but occasionally she produces a blockbuster.

I won’t put a graph here – it’s not clear enough. This table shows how many times we’ve each made more than a particular threshold (in a single turn). The figures are cumulative

Threshold
Karthik
Priyanka
30 50 44
40 12 17
50 5 10
60 3 5
70 2 2
80 0 1
90 0 1
100 0 1

Notice that while I’ve made many more 30+ scores than her, she’s made many more 40+ scores than me. Beyond that, she has crossed every threshold at least as many times as me.

Another piece of analysis is the “score multiple”. This is a measure of “how well we use our letters”. For example, if I start place the word “tiger” on a double word score (and no double or triple letter score), I get 12 points. The points total on the tiles is 6, giving me a multiple of 2.

Over the games I have found that I have a multiple of 1.75, while she has a multiple of 1.70. So I “utilise” the tiles that I have (and the ones on the board) a wee bit “better” than her, though she often accuses me of “over optimising”.

It’s been fun so far. There was a period of time when we were addicted to the game, and we still turn to it when one of us is in a “work rut”. And thanks to maintaining scores on Excel, the analytics after is also fun.

I’m pretty sure you’re spending the lockdown playing some board game as well. I strongly urge you to use Excel (or equivalent) to maintain scores. The analytics provides a very strong collateral benefit.

 

Covid-19 superspreaders in Karnataka

Through a combination of luck and competence, my home state of Karnataka has handled the Covid-19 crisis rather well. While the total number of cases detected in the state edged past 2000 recently, the number of locally transmitted cases detected each day has hovered in the 20-25 range.

Perhaps the low case volume means that Karnataka is able to give out data at a level that few others states in India are providing. For each case, the rationale behind why the patient was tested (which is usually the source where they caught the disease) is given. This data comes out in two daily updates through the @dhfwka twitter handle.

There was this research that came out recently that showed that the spread of covid-19 follows a classic power law, with a low value of “alpha”. Basically, most infected people don’t infect anyone else. But there are a handful of infected people who infect lots of others.

The Karnataka data, put out by @dhfwka  and meticulously collected and organised by the folks at covid19india.org (they frequently drive me mad by suddenly changing the API or moving data into a new file, but overall they’ve been doing stellar work), has sufficient information to see if this sort of power law holds.

For every patient who was tested thanks to being a contact of an already infected patient, the “notes” field of the data contains the latter patient’s ID. This way, we are able to build a sort of graph on who got the disease from whom (some people got the disease “from a containment zone”, or out of state, and they are all ignored in this analysis).

From this graph, we can approximate how many people each infected person transmitted the infection to. Here are the “top” people in Karnataka who transmitted the disease to most people.

Patient 653, a 34 year-old male from Karnataka, who got infected from patient 420, passed on the disease to 45 others. Patient 419 passed it on to 34 others. And so on.

Overall in Karnataka, based on the data from covid19india.org as of tonight, there have been 732 cases where a the source (person) of infection has been clearly identified. These 732 cases have been transmitted by 205 people. Just two of the 205 (less than 1%) are responsible for 79 people (11% of all cases where transmitter has been identified) getting infected.

The top 10 “spreaders” in Karnataka are responsible for infecting 260 people, or 36% of all cases where transmission is known. The top 20 spreaders in the state (10% of all spreaders) are responsible for 48% of all cases. The top 41 spreaders (20% of all spreaders) are responsible for 61% of all transmitted cases.

Now you might think this is not as steep as the “well-known” Pareto distribution (80-20 distribution), except that here we are only considering 20% of all “spreaders”. Our analysis ignores the 1000 odd people who were found to have the disease at least one week ago, and none of whose contacts have been found to have the disease.

I admit this graph is a little difficult to understand, but basically I’ve ordered people found for covid-19 in Karnataka by number of people they’ve passed on the infection to, and graphed how many people cumulatively they’ve infected. It is a very clear pareto curve.

The exact exponent of the power law depends on what you take as the denominator (number of people who could have infected others, having themselves been infected), but the shape of the curve is not in question.

Essentially the Karnataka validates some research that’s recently come out – most of the disease spread stems from a handful of super spreaders. A very large proportion of people who are infected don’t pass it on to any of their contacts.

Placing data labels in bar graphs

If you think you’re a data visualisation junkie, it’s likely that you’ve read Edward Tufte’s Visual Display Of Quantitative Information. If you are only a casual observer of the topic, you are likely to have come across these gifs that show you how to clean up a bar graph and a data table.

And if you are a real geek when it comes to visualisation, and you are the sort of person who likes long-form articles about the information technology industry, I’m sure you’ve come across Eugene Wei’s massive essay on “remove the legend to become one“.

The idea in the last one is that when you have something like a line graph, a legend telling you which line represents what can be distracting, especially if you have too many lines. You need to constantly move your head back and forth between the chart and the table as you try to interpret it. So, Wei says, in order to “become a legend” (by presenting information that is easily consumable), you need to remove the legend.

My equivalent of that for bar graphs is to put data labels directly on the bar, rather than having the reader keep looking at a scale (the above gif with bar graphs also does this). It makes for easier reading, and by definition, the bar graph conveys the information on the relative sizes of the different data points as well.

There is one problem, though, especially when you’re drawing what my daughter calls “sleeping bar graphs” (horizontal) – where do you really put the text so that it is easily visible? This becomes especially important if you’re using a package like R ggplot where you have control over where to place the text, what size to use and so on.

The basic question is – do you place the label inside or outside the bar? I was grappling with this question yesterday while making some client chart. When I placed the labels inside the bar, I found that some of the labels couldn’t be displayed in full when the bars were too short. And since these were bars that were being generated programmatically, I had no clue beforehand how long the bars would be.

So I decided to put all the labels outside. This presented a different problem – with the long bars. The graph would automatically get cut off a little after the longest bar ended, so if you placed the text outside, then the labels on the longest bar couldn’t be seen! Again the graphs have to come out programmatically so when you’re making them you don’t know what the length of the longest bar will be.

I finally settled on this middle ground – if the bar is at least half as long as the longest bar in the chart set, then you put the label inside the bar. If the bar is shorter than half the longest bar, then you put the label outside the bar. And then, the text inside the bar is right-justified (so it ends just inside the end of the bar), and the text outside the bar is left-justified (so it starts exactly where the bar ends). And ggplot gives you enough flexibility to decide the justification (‘hjust’) and colour of the text (I keep it white if it is inside the bar, black if outside), that the whole thing can be done programmatically, while producing nice and easy-to-understand bar graphs with nice labels.

Obviously I can’t share my client charts here, but here is one I made for the doubling days for covid-19 cases by district in India. I mean it’s not exactly what I said here, but comes close (the manual element here is a bit more).

 

 

More on covid testing

There has been a massive jump in the number of covid-19 positive cases in Karnataka over the last couple of days. Today, there were 44 new cases discovered, and yesterday there were 36. This is a big jump from the average of about 15 cases per day in the preceding 4-5 days.

The good news is that not all of this is new infection. A lot of cases that have come out today are clusters of people who have collectively tested positive. However, there is one bit from yesterday’s cases (again a bunch of clusters) that stands out.

Source: covid19india.org

I guess by now everyone knows what “travelled from Delhi” is a euphemism for. The reason they are interesting to me is that they are based on a “repeat test”. In other words, all these people had tested negative the first time they were tested, and then they were tested again yesterday and found positive.

Why did they need a repeat test? That’s because the sensitivity of the Covid-19 test is rather low. Out of every 100 infected people who take the test, only about 70 are found positive (on average) by the test. That also depends upon when the sample is taken.  From the abstract of this paper:

Over the four days of infection prior to the typical time of symptom onset (day 5) the probability of a false negative test in an infected individual falls from 100% on day one (95% CI 69-100%) to 61% on day four (95% CI 18-98%), though there is considerable uncertainty in these numbers. On the day of symptom onset, the median false negative rate was 39% (95% CI 16-77%). This decreased to 26% (95% CI 18-34%) on day 8 (3 days after symptom onset), then began to rise again, from 27% (95% CI 20-34%) on day 9 to 61% (95% CI 54-67%) on day 21.

About one in three (depending upon when you draw the sample) infected people who have the disease are found by the test to be uninfected. Maybe I should state it again. If you test a covid-19 positive person for covid-19, there is almost a one-third chance that she will be found negative.

The good news (at the face of it) is that the test has “high specificity” of about 97-98% (this is from conversations I’ve had with people in the know. I’m unable to find links to corroborate this), or a false positive rate of 2-3%. That seems rather accurate, except that when the “prior probability” of having the disease is low, even this specificity is not good enough.

Let’s assume that a million Indians are covid-19 positive (the official numbers as of today are a little more than one-hundredth of that number). With one and a third billion people, that represents 0.075% of the population.

Let’s say we were to start “random testing” (as a number of commentators are advocating), and were to pull a random person off the street to test for Covid-19. The “prior” (before testing) likelihood she has Covid-19 is 0.075% (assume we don’t know anything more about her to change this assumption).

If we were to take 20000 such people, 15 of them will have the disease. The other 19985 don’t. Let’s test all 20000 of them.

Of the 15 who have the disease, the test returns “positive” for 10.5 (70% accuracy, round up to 11). Of the 19985 who don’t have the disease, the test returns “positive” for 400 of them (let’s assume a specificity of 98% (or a false positive rate of 2%), placing more faith in the test)! In other words, if there were a million Covid-19 positive people in India, and a random Indian were to take the test and test positive, the likelihood she actually has the disease is 11/411 = 2.6%.

If there were 10 million covid-19 positive people in India (no harm in supposing), then the “base rate” would be .75%. So out of our sample of 20000, 150 would have the disease. Again testing all 20000, 105 of the 150 who have the disease would test positive. 397 of the 19850 who don’t have the disease will test positive. In other words, if there were ten million Covid-19 positive people in India, and a random Indian were to take the test and test positive, the likelihood she actually has the disease is 105/(397+105) = 21%.

If there were ten million Covid-19 positive people in India, only one-fifth of the people who tested positive in a random test would actually have the disease.

Take a sip of water (ok I’m reading The Ken’s Beyond The First Order too much nowadays, it seems).

This is all standard maths stuff, and any self-respecting book or course on probability and Bayes’s Theorem will have at least a reference to AIDS or cancer testing. The story goes that this was a big deal in the 1990s when some people suggested that the AIDS test be used widely. Then, once this problem of false positives and posterior probabilities was pointed out, the strategy of only testing “high risk cases” got accepted.

And with a “low incidence” disease like covid-19, effective testing means you test people with a high prior probability. In India, that has meant testing people who travelled abroad, people who have come in contact with other known infected, healthcare workers, people who attended the Tablighi Jamaat conference in Delhi, and so on.

The advantage with testing people who already have a reasonable chance of having the disease is that once the test returns positive, you can be pretty sure they actually have the disease. It is more effective and efficient. Testing people with a “high prior probability of disease” is not discriminatory, or a “sampling bias” as some commentators alleged. It is prudent statistical practice.

Again, as I found to my own detriment with my tweetstorm on this topic the other day, people are bound to see politics and ascribe political motives to everything nowadays. In that sense, a lot of the commentary is not surprising. It’s also not surprising that when “one wing” heavily retweeted my article, “the other wing” made efforts to find holes in my argument (which, again, is textbook math).

One possibly apolitical criticism of my tweetstorm was that “the purpose of random testing is not to find out who is positive. It is to find out what proportion of the population has the disease”. The cost of this (apart from the monetary cost of actually testing) are threefold. Firstly, a large number of uninfected people will get hospitalised in covid-specific hospitals, clogging hospital capacity and increasing the chances that they get infected while in hospital.

Secondly, getting a truly random sample in this case is tricky, and possibly unethical. When you have limited testing capacity, you would be inclined (possibly morally, even) to use it on people who already have a high prior probability.

Finally, when the incidence is small, we need a really large sample to find out the true range.

Let’s say 1 in 1000 Indians have the disease (or about 1.35 million people). Using the Chi Square test of proportions, our estimate of the incidence of the disease varies significantly on how many people are tested.

If we test a 1000 people and find 1 positive, the true incidence of the disease (95% confidence interval) could be anywhere from 0.01% to 0.65%.

If we test 10000 people and find 10 positive, the true incidence of the disease could be anywhere between 0.05% and 0.2%.

Only if we test 100000 people (a truly massive random sample) and find 100 positive, then the true incidence lies between 0.08% and 0.12%, an acceptable range.

I admit that we may not be testing enough. A simple rule of thumb is that anyone with more than a 5% prior probability of having the disease needs to be tested. How we determine this prior probability is again dependent on some rules of thumb.

I’ll close by saying that we should NOT be doing random testing. That would be unethical on multiple counts.

Simulating Covid-19 Scenarios

I must warn that this is a super long post. Also I wonder if I should put this on medium in order to get more footage.

Most models of disease spread use what is known as a “SIR” framework. This Numberphile video gives a good primer into this framework.

The problem with the framework is that it’s too simplistic. It depends primarily on one parameter “R0”, which is the average number of people that each infected patient infects. When R0 is high, each patient infects a number of other people, and the disease spreads fast. With a low R0, the disease spreads slow. It was the SIR model that was used to produce all those “flatten the curve” pictures that we were bombarded with a week or two back.

There is a second parameter as well – the recovery or removal rate. Some diseases are so lethal that they have a high removal rate (eg. Ebola), and this puts a natural limit on how much the disease can spread, since infected people die before they can infect too many people.

In any case, such modelling is great for academic studies, and post-facto analyses where R0 can be estimated. As we are currently in the middle of an epidemic, this kind of simplistic modelling can’t take us far. Nobody has a clue yet on what the R0 for covid-19 is. Nobody knows what proportion of total cases are asymptomatic. Nobody knows the mortality rate.

And things are changing well-at-a-faster-rate. Governments are imposing distancing of various forms. First offices were shut down. Then shops were shut down. Now everything is shut down, and many of us have been asked to step out “only to get necessities”. And in such dynamic and fast-changing environments, a simplistic model such as the SIR can only take us so far, and uncertainty in estimating R0 means it can be pretty much useless as well.

In this context, I thought I’ll simulate a few real-life situations, and try to model the spread of the disease in these situations. This can give us an insight into what kind of services are more dangerous than others, and how we could potentially “get back to life” after going through an initial period of lockdown.

The basic assumption I’ve made is that the longer you spend with an infected person, the greater the chance of getting infected yourself. This is not an unreasonable assumption because the spread happens through activities such as sneezing, touching, inadvertently dropping droplets of your saliva on to the other person, and so on, each of which is more likely the longer the time you spend with someone.

Some basic modelling revealed that this can be modelled as a sort of negative exponential curve that looks like this.

p = 1 - e^{-\lambda T}

T is the number of hours you spend with the other person. \lambda is a parameter of transmission – the higher it is, the more likely the disease with transmit (holding the amount of time spent together constant).

The function looks like this: 

We have no clue what \lambda is, but I’ll make an educated guess based on some limited data I’ve seen. I’ll take a conservative estimate and say that if an uninfected person spends 24 hours with an infected person, the former has a 50% chance of getting the disease from the latter.

This gives the value of \lambda to be 0.02888 per hour. We will now use this to model various scenarios.

  1. Delivery

This is the simplest model I built. There is one shop, and N customers.  Customers come one at a time and spend a fixed amount of time (1 or 2 or 5 minutes) at the shop, which has one shopkeeper. Initially, a proportion p of the population is infected, and we assume that the shopkeeper is uninfected.

And then we model the transmission – based on our \lambda = 0.02888, for a two minute interaction, the probability of transmission is 1 - e^{-\lambda T} = 1 - e^{-\frac{0.02888 * 2}{60}} ~= 0.1%.

In hindsight, I realised that this kind of a set up better describes “delivery” than a shop. With a 0.1% probability the delivery person gets infected from an infected customer during a delivery. With the same probability an infected delivery person infects a customer. The only way the disease can spread through this “shop” is for the shopkeeper / delivery person to be uninfected.

How does it play out? I simulated 10000 paths where one guy delivers to 1000 homes (maybe over the course of a week? that doesn’t matter as long as the overall infected rate in the population otherwise is constant), and spends exactly two minutes at each delivery, which is made to a single person. Let’s take a few cases, with different base cases of incidence of the disease – 0.1%, 0.2%, 0.5%, 1%, 2%, 5%, 10%, 20% and 50%.

The number of NEW people infected in each case is graphed here (we don’t care how many got the disease otherwise. We’re modelling how many got it from our “shop”). The  right side graph excludes the case of zero new infections, just to show you the scale of the problem.

Notice this – even when 50% of the population is infected, as long as the shopkeeper or delivery person is not initially infected, the chances of additional infections through 2-minute delivery are MINUSCULE. A strong case for policy-makers to enable delivery of all kinds, essential or inessential.

2. SHOP

Now, let’s complicate matters a little bit. Instead of a delivery person going to each home, let’s assume a shop. Multiple people can be in the shop at the same time, and there can be more than one shopkeeper.

Let’s use the assumptions of standard queueing theory, and assume that the inter-arrival time for customers is guided by an Exponential distribution, and the time they spend in the shop is also guided by an Exponential distribution.

At the time when customers are in the shop, any infected customer (or shopkeeper) inside can infect any other customer or shopkeeper. So if you spend 2 minutes in a shop where there is 1 infected person, our calculation above tells us that you have a 0.1% chance of being infected yourself. If there are 10 infected people in the shop and you spend 2 minutes there, this is akin to spending 20 minutes with one infected person, and you have a 1% chance of getting infected.

Let’s consider two or three scenarios here. First is the “normal” case where one customer arrives every 5 minutes, and each customer spends 10 minutes in the shop (note that the shop can “serve” multiple customers simultaneously, so the queue doesn’t blow up here). Again let’s take a total of 1000 customers (assume a 24/7 open shop), and one shopkeeper.

 

Notice that there is significant transmission of infection here, even though we started with 5% of the population being infected. On average, another 3% of the population gets infected! Open supermarkets with usual crowd can result in significant transmission.

Does keeping the shop open with some sort of social distancing (let’s see only one-fourth as many people arrive) work? So people arrive with an average gap of 20 minutes, and still spend 10 minutes in the shop. There are still 10 shopkeepers. What does it look like when we start with 5% of the people being infected?

The graph is pretty much identical so I’m not bothering to put that here!

3. Office

This scenario simulates for N people who are working together for a certain number of hours. We assume that exactly one person is infected at the beginning of the meeting. We also assume that once a person is infected, she can start infecting others in the very next minute (with our transmission probability).

How does the infection grow in this case? This is an easier simulation than the earlier one so we can run 10000 Monte Carlo paths. Let’s say we have a “meeting” with 40 people (could just be 40 people working in a small room) which lasts 4 hours. If we start with one infected person, this is how the number of infected grows over the 4 hours.

 

 

 

The spread is massive! When you have a large bunch of people in a small closed space over a significant period of time, the infection spreads rapidly among them. Even if you take a 10 person meeting over an hour, one infected person at the start can result in an average of 0.3 other people being infected by the end of the meeting.

10 persons meeting over 8 hours (a small office) with one initially infected means 3.5 others (on average) being infected by the end of the day.

Offices are dangerous places for the infection to spread. Even after the lockdown is lifted, some sort of work from home regulations need to be in place until the infection has been fully brought under control.

4. Conferences

This is another form of “meeting”, except that at each point in time, people don’t engage with the whole room, but only a handful of others. These groups form at random, changing every minute, and infection can spread only within a particular group.

Let’s take a 100 person conference with 1 initially infected person. Let’s assume it lasts 8 hours. Depending upon how many people come together at a time, the spread of the infection rapidly changes, as can be seen in the graph below.

If people talk two at a time, there’s a 63% probability that the infection doesn’t spread at all. If they talk 5 at a time, this probability is cut by half. And if people congregate 10 at a time, there’s only a 11% chance that by the end of the day the infection HASN’T propagated!

One takeaway from this is that even once offices start functioning, they need to impose social distancing measures (until the virus has been completely wiped out). All large-ish meetings by video conference. A certain proportion of workers working from home by rotation.

And I wonder what will happen to the conferences.

I’ve put my (unedited) code here. Feel free to use and play around.

Finally, you might wonder why I’ve made so many Monte Carlo Simulations. Well, as the great Matt Levine had himself said, that’s my secret sauce!

 

Distribution of political values

Through Baal on Twitter I found this “Political Compass” survey. I took it, and it said this is my “political compass”.

Now, I’m not happy with the result. I mean, I’m okay with the average value where the red dot has been put for me, and I think that represents my political leanings rather well. However, what I’m unhappy about is that my political views have been all reduced to one single average point.

I’m pretty sure that based on all the answers I gave in the survey, my political leaning across both the two directions follows a distribution, and the red dot here is only the average (mean, I guess, but could also be median) value of that distribution.

However, there are many ways in which people can have a political view that lands right on my dot – some people might have a consistent but mild political view in favour of or against a particular position. Others might have pretty extreme views – for example, some of my answers might lead you to believe that I’m an extreme right winger, and others might make me look like a Marxist (I believe I have a pretty high variance on both axes around my average value).

So what I would have liked instead from the political compass was a sort of heat map, or at least two marginal distributions, showing how I’m distributed along the two axes, rather than all my views being reduced to one average value.

A version of this is the main argument of this book I read recently called “The End Of Average“. That when we design for “the average man” or “the average customer”, and do so across several dimensions,  we end up designing for nobody, since nobody is average when looked at on many dimensions.