Getting caught in rather heavy early morning traffic while on my way to a meeting today made me think of the concept of correlation. This was driven by the fact that I noticed a higher proportion of cars than usual this morning. It had rained early this morning, and more people were taking out their cars as a precautionary measure, I reasoned.
Assume you are the facilities manager at a company which is going to move to a new campus. You need to decide how many parking slots to purchase at the new location. You know that all your employees possess both a two wheeler and a car, and use either to travel to work. Car parking space is much more expensive than two wheeler parking space, so you want to optimize on costs. How will you decide how many parking spaces to purchase?
You will correctly reason that not everyone brings their car every day. For a variety of reasons, people might choose to travel to work by scooter. You decide to use data to make your decision on parking space. For three months, you go down to the basement (of the old campus) and count the number of cars, and you diligently tabulate them. At the end of the three months, you calculate that on an average (median), thirty people bring their cars to work every day. You calculate that on ninety five percent of the days there were forty or fewer cars in the basement, and on no occasion did the total number of cars in the basement cross forty five.
So you decide to purchase forty car parking spaces in the new facility. It is not the same set of people who bring their cars to work every day. In fact, each employee has brought his/her car to the workplace at least once in the last three months. What you are betting on here, however, is correlation, You assume that the reason Alice brings her car to office is not related to the reason Bob brings his car to office. To put it statistically, you assume that Alice bringing her car and Bob bringing his car are independent events. Whether Alice brings her car or not has no bearing on Bob’s decision to bring his car, and vice versa. And you know that even on the odd day when more than forty people bring their cars, there are not more than forty five cars, and you can somehow “adjust” with your neighbours to borrow the additional slots for that day. You get a certificate from the CEO for optimizing on the cost of parking space.
And then one rainy morning things go horribly wrong. Your phone doesn’t stop ringing. Angry staffers are calling you complaining that they have no place to park. Given the heavy rains that morning, none of the staffers have wanted to risk getting wet in the rain, and have all decided to bring their cars. Never before have they faced a problem parking so they are all confident that there will be no problem parking once they get to work, only to realize there is not enough parking space. Over a hundred employees have driven to work, and there are only forty slots to park.
The problem here, as you might discover, is that of correlation. You had assumed that Alice’s reason to get her car was uncorrelated to Bob’s decision. What you had not accounted for was the possibility that there could be an exogenous event that could suddenly drive the correlation from zero to one, thus upsetting all your calculations!
This is analogous to what happened during the Financial Crisis of 2008. Normally, Alice defaulting on her home loan is not correlated with Bob defaulting on his. So you take a thousand such loans, all seemingly uncorrelated with each other and put them in a bundle, assuming that 99% of the time not more than five loans will default. You then slice this bundle into tranches, get some of them rated AAA, and sell them on to investors (and keep some for yourself). All this while, you have assumed that the loans are uncorrelated. In fact, the independence was a key assumption in your expectation of the number of loans that will default and in your highest tranche getting a AAA rating.
Now, for reasons beyond your control and understanding, house prices drop. Soon it becomes possible for home owners to willfully default on their loans – the value of the debt now exceeds the value of their home. With one such exogenous event, correlations suddenly rise. Fifty loans in your pool of thousand default (a 1 in gazillion event according to your calculations that assumed zero correlation). Your AAA tranche is forced to pay out less than full value. The lower tranches get wiped out. This and a thousand similar bundles of loans set off what ultimately became the Financial Crisis of 2008.
The point of this post is that you need to be careful about assuming correlations. It is to illustrate that sometimes an exogenous event can upset your calculations of correlations. And when you go wrong with your correlations – especially those among a large number of variables, you can get hurt real bad.
I’ll leave you with a thought: assuming you live in a primarily two wheeler city (like Bangalore, where I live), what will happen to the traffic on a day when 10% more people than usual get out their cars?