Sigma and normal distributions

I’m in my way to the Bangalore airport now, north of hebbal flyover. It’s raining like crazy again today – the second time in a week it’s raining so bad.

I instinctively thought “today is an N sigma day in terms of rain in Bangalore” (where N is a large number). Then I immediately realized that such a statement would make sense only if rainfall in Bangalore were to follow a normal distribution!

When people normally say something is an N sigma event what they’re really trying to convey is that it is a very improbable event and the N is a measure of this improbability. The relationship between N and the improbability implied is given by the shape of the normal curve.

However when a quantity follow a distribution other than normal the relationship between the mean and standard deviation (sigma) and the implied probability breaks down and the number of sigmas will mean something totally different in terms of the implied improbability.

It is good practice, thus, to stop talking in terms of sigma and talk in terms of of odds. It’s better to say “a one in forty event” rather than saying “two sigma event” (I’m assuming a one tailed normal distribution here).

The broader point is that the normal distribution is too ingrained in people’s minds which leads then to assume all quantities follow a normal distribution – which is dangerous and needs to be discouraged strongly.

In this direction any small measure – like talking odds rather than in terms of sigma – will go a long way!

5 thoughts on “Sigma and normal distributions”

  1. In non-normal distributions, I think you can use the inter-quartile distance as an approximation of sigma. So, if you are in the less than median space, median – Q1 (25th percentile) can be used. If you are in the more than median space, then Q3 (75th percentile) – median can be used. And now multiply that value with 2 or 3 or 5 etc to determine your realistic or “normal” space and “outlier” space. So instead of calling something 3 sigma or 5 sigma etc, you can say 3*(Q3-Median) or 3*(Median-Q1) for the same applications. You can give it a name with less syllables than inter-quartile distance so that it can replace sigma. Maybe call it sigmann (sigma-non-normal). 🙂

  2. One thing “N sigma” guarantees is that the likelihood is at most 1 / N^2. So for any distribution, a 3-sigma event has a probability of 1/9 or lower. This comes from Chebyshev’s inequality. For a normal distribution, the probability is much smaller (0.27%) than this theoretical limit.

    1. Thanks for reminding me about Tchebysheff’s inequality – I’d forgotten about it. But it’s a very weak bound as you mention!

  3. Makes sense. This made me go back to a post you had on your other work related blog titled ‘The Signficicance of Statistical Significance’. Just realized that the example of the 2 data sets you had given there (the one with mean 100 and another with mean 101 with same std dev and Standard Error is actually not significant.

    They way you would approach that would be like this:

    A 95% CI can be calculated using (x1bar – X2bar) +- 1.96(SE1 + SE2)
    1.96 because that is the value for a 95% CI.

    So for the example you provided, it would be:

    (101 – 100) +- 1.96(0.1 + 0.1) = (-2.92 , 4.92)
    And since the null hypothesis, which is X1bar – X2bar = 0 is within this range, you cannot reject the null hypothesis and hence there is no statistical significance between the two sets of data.

    In this case you have considered a technique of ‘difference of the means’. However, for a case like a ‘before medication and after medication’, the better approach would be to consider ‘mean of the differences’, that is, subtract each value from the first observation with the corresponding value from the second (for all 10000 observations) and calculate the mean and std dev of the resulting set.

    Please let me know your thoughts.

Put Comment