Covid-19 superspreaders in Karnataka

Through a combination of luck and competence, my home state of Karnataka has handled the Covid-19 crisis rather well. While the total number of cases detected in the state edged past 2000 recently, the number of locally transmitted cases detected each day has hovered in the 20-25 range.

Perhaps the low case volume means that Karnataka is able to give out data at a level that few others states in India are providing. For each case, the rationale behind why the patient was tested (which is usually the source where they caught the disease) is given. This data comes out in two daily updates through the @dhfwka twitter handle.

There was this research that came out recently that showed that the spread of covid-19 follows a classic power law, with a low value of “alpha”. Basically, most infected people don’t infect anyone else. But there are a handful of infected people who infect lots of others.

The Karnataka data, put out by @dhfwka  and meticulously collected and organised by the folks at (they frequently drive me mad by suddenly changing the API or moving data into a new file, but overall they’ve been doing stellar work), has sufficient information to see if this sort of power law holds.

For every patient who was tested thanks to being a contact of an already infected patient, the “notes” field of the data contains the latter patient’s ID. This way, we are able to build a sort of graph on who got the disease from whom (some people got the disease “from a containment zone”, or out of state, and they are all ignored in this analysis).

From this graph, we can approximate how many people each infected person transmitted the infection to. Here are the “top” people in Karnataka who transmitted the disease to most people.

Patient 653, a 34 year-old male from Karnataka, who got infected from patient 420, passed on the disease to 45 others. Patient 419 passed it on to 34 others. And so on.

Overall in Karnataka, based on the data from as of tonight, there have been 732 cases where a the source (person) of infection has been clearly identified. These 732 cases have been transmitted by 205 people. Just two of the 205 (less than 1%) are responsible for 79 people (11% of all cases where transmitter has been identified) getting infected.

The top 10 “spreaders” in Karnataka are responsible for infecting 260 people, or 36% of all cases where transmission is known. The top 20 spreaders in the state (10% of all spreaders) are responsible for 48% of all cases. The top 41 spreaders (20% of all spreaders) are responsible for 61% of all transmitted cases.

Now you might think this is not as steep as the “well-known” Pareto distribution (80-20 distribution), except that here we are only considering 20% of all “spreaders”. Our analysis ignores the 1000 odd people who were found to have the disease at least one week ago, and none of whose contacts have been found to have the disease.

I admit this graph is a little difficult to understand, but basically I’ve ordered people found for covid-19 in Karnataka by number of people they’ve passed on the infection to, and graphed how many people cumulatively they’ve infected. It is a very clear pareto curve.

The exact exponent of the power law depends on what you take as the denominator (number of people who could have infected others, having themselves been infected), but the shape of the curve is not in question.

Essentially the Karnataka validates some research that’s recently come out – most of the disease spread stems from a handful of super spreaders. A very large proportion of people who are infected don’t pass it on to any of their contacts.

Anscombe’s Quartet and Analytics

Many “analytics professionals” or “quants” I know or have worked with have no hesitation in diving straight into a statistical model when they are faced with a problem, rather than trying to understand the data. However, that is not the way I work. Whenever I set out solving a new problem, I start with spending considerable time trying to get a feel of the data. There are many things I do to “feel” the data – look at a few lines of data, look at descriptive statistics of some of the variables and distributions of individual variables. The most powerful tool, however, that lets me get a feel for data is the humble scatterplot.

The beauty of the scatter plot is that it allows you to get a real feel for the data. Taking variables two at a time, it not only shows you how each of them is distributed but also how they are related to each other. Relationships that are not apparent when you look at the data become apparent when you graph them. I may not be wrong in saying that the scatterplot defines the direction and scope of your entire solution.

The problem with the debate on how analytics needs to be done is that it is loaded. A large majority of people who use statistics in their daily work dive straight into analysis without looking at the data. Perhaps they deem that looking at data is a waste of time? I have even seen pitch decks by extremely reputed software companies that propose solutions such as “we will solve this problem using Logistic Regression” without even having seen the data.

Let us take an example now. Take the following four data sets (my apologies for putting an image here):


Let us say you dive straight into the analysis. Like a good “analytics professional” you dive straight into regression. You may even do some descriptive statistics for each of the data sets along the way. And this is what you find (again, apologies for the image)


Do you conclude that the four data sets are the same? Pretty much identical statistics right? I wouldn’t be surprised if you were to publish that there is nothing to differentiate between these four data sets. Now, let us do a simple scatter plot of each of these data sets and check for ourselves:


Now, do you still think these data sets are identical? Now you know why I stress so much upon getting a feel for the data and drawing the humble scatter plot?

The data set I’ve used here is a rather famous one, and it is called Anscombe’s Quartet. The purpose of the data set is to precisely describe what I have in this post. That one needs to get a feel for the data before diving into the analysis. Draw scatter plots for every pair of variables. Understand the relationships, and let this understanding guide your further analysis. If one were able to perfectly analyze every piece of data by diving straight into a regression, the job of analytics might as well be outsourced to computers.

PS: it is a tragedy that when they teach visualization in school they don’t even mention the scatter plot. At a recent workshop I asked the participants to name the different kinds of graphs they knew. “Line”, “Bar” and “Pie” were the mots common answers. Not one answered “scatter plot”. Given the utility of this simple plot this is indeed tragic.