The Signficicance of Statistical Significance

Last year, an aunt was diagnosed with extremely low bone density. She had been complaining of back pain and weakness, and a few tests later, her orthopedic confirmed that bone density was the problem. She was put on a course of medication, and then was given by shots. A year later, she got her bone density tested again, and found that there was not much improvement.

She did a few rounds of the doctors again – orthopedics, endocrinologists and the like, and the first few were puzzled that the medication and the shots had had no effect. One of the doctors, though, saw something others didn’t – “there is no marked improvement, for sure”, he remarked, “but there is definitely some improvement”.

Let us say you take ten thousand observations in “state A”, and another ten thousand in “state B”. The average of your observations in state A is 100, and the standard deviation is 10. The average of your observations in state B is 101, and the standard deviation is 10. Is there a significant difference between the observations in the two states?

Statistically speaking, there most definitely is (with 10000 samples, the “standard error” given a standard deviation of 10 is 0.1 (10 / sqrt(10000) ), and the two sets of observations are ten standard errors apart which means that the difference between them is “statistically significant” to a high degree of significance. The question, however, is if the difference is actually “significant” (in the non-statistical sense of the word).

Think about it from the context of drug testing. Let us say that we are testing a drug for increasing bone density among people with low bone density (like my aunt). Let’s say we catch 10000 mice and measure their bone densities. Let’s say the average is 100, with a standard deviation of 10.

Now, let us inject our drug (in the appropriate dosage – scaled down from man to mouse) on our mice, and after they’ve undergone the requisite treatment, measure the bone densities again. Let’s say that the average is now 101, with a standard deviation of 10. Based on this test, can we conclude that our drug is effective for improving bone density?

What cannot be denied is that one course of medication among mice produces results that are statistically significant – there is an increase in bone density among mice that cannot be explained by randomness alone. From this perspective, the drug is undoubtedly effective – that there is a positive effect from taking the drug is extremely highly likely.

However, does this mean that we use this drug for treating low bone density? Despite the statistical significance, the answer to this is not very clear. Let us for a moment assume that there are no competitors – there is no other known drug which can increase a patient’s bone density by a statistically significant amount. So the choice is this – we either not use any drug, leading to no improvement in the patient (let us assume that another experiment has shown that in the absence of drugging, there is no change in bone density) or we use this drug, which produces a small but statistically significant improvement. What do we do?

The question we need to answer here is whether the magnitude of improvement on account of taking this drug is worth the cost (monetary cost, possible side effects, etc.) of taking the drug. Do we want to put the patient through the trouble of taking the medication when we know that the difference it will make, though statistically significant, is marginal? It is a fuzzy question, and doesn’t necessarily have a clear answer.

In summary, the basic point is that a statistically significant improvement does not mean that the difference is significant in terms of magnitude. With samples large enough, even small changes can be statistically significant, and we need to be cognizant of that.

Postscript
No mice were harmed in the course of writing this blog post

Standard Error in Survey Statistics

Over the last week or more, one of the topics of discussion in the pink papers has been the employment statistics that were recently published by the NSSO. Mint, which first carried the story, has now started a whole series on it, titled “The Great Jobs Debate” where people from both sides of the fence have been using the paper to argue their case as to why the data makes or doesn’t make sense.

The story started when Mint Editor and Columnist Anil Padmanabhan (who, along with Aditya Sinha (now at DNA) and Aditi Phadnis (of Business Standard), ranks among my favourite political commentators in India) pointed out that the number of jobs created during the first UPA government (2004-09) was about 1 million, which is far less than the number of jobs created during the preceding NDA government (~ 60 million). And this has led to hue and cry from all sections. Arguments include leftists who say that jobless growth is because of too much reforms, rightists saying we aren’t creating jobs because we haven’t had enough reform, and some other people saying there’s something wrong in the data. Chief Statistician TCA Anant, in his column published in the paper, tried to use some obscurities in the sub-levels of the survey to point out why the data makes sense.

In today’s column, Niranjan Rajadhyaksha points out that the way employment is counted in India is very different from the way it is in developed countries. In the latter, employers give statistics of their payroll to the statistics collection agency periodically. However, due to the presence of the large unorganized sector, this is not possible in India so we resort to “surveys”, for which the NSSO is the primary organization.

In a survey, to estimate a quantity across a large sample, we simply take a much smaller sample, which is small enough for us to rigorously measure this quantity. Then, we try and extrapolate the results to the large sample. The key thing in survey is “standard error”, which is a measure of error that the “observed statistic” is different from the “true statistic”. What intrigues me is that there is absolutely no mention of the standard error in any of the communication about this NSSO survey (again I’m relying on the papers here, haven’t seen the primary data).

Typically, when we measure something by means of a survey, the “true value” is usually expressed in terms of the “95% confidence range”. What we say is “with 95% probability, the true value of XXXX lies between Xmin and Xmax”. An alternate way of representation is “we think the value of XXXX is centred at Xmid with a standard error of Xse”. So in order to communicate numbers computed from a survey, it is necessary to give out two numbers. So what is the NSSO doing by reporting just one number (most likely the mid)?

Samples used by NSSO are usually very small. At least, they are very small compared to the overall population, which makes the standard error to be very large. Could it be that the standard error is not reported because it’s so large that the mean doesn’t make sense? And if the standard error is so large, why should we even use this data as a basis to formulate policy?

So here’s my verdict: the “estimated mean” of the employment as of 2009 is not very different from the “estimated mean” of the employment as of 2004. However, given that the sample sizes are small, the standard error will be large. So it is very possible that the true mean of employment as of 2009 is actually much higher than the true mean of 2004 (by the same argument, it could be the other way round, which points at something more grave). So I conclude that given the data we have here (assuming standard errors aren’t available), we have insufficient data to conclude anything about the job creation during the UPA1 government, and its policy implications.