## Analytics and complexity

I recently learnt that a number of people think that the more the number of variables you use in your model, the better your model is! What has surprised me is that I’ve met a lot of people who think so, and recommendations for simple models haven’t been taken too kindly.

The conversation usually goes like this

“so what variables have you considered for your analysis of ______ ?”
“A,B,C”
“Why don’t you consider D,E,F,… X,Y,Z also? These variables matter for these reasons. You should keep all of them and build a more complete model”
“Well I considered them but they were not significant so my model didn’t pick them up”
“No but I think your model is too simplistic if it uses only three variables”

This is a conversation i’ve had with so many people that i wonder what kind of conceptions people have about analytics. Now I wonder if this is because of the difference in the way I communicate compared to other “analytics professionals”.

When you do analytics, there are two ways to communicate – to simplify and to complicate (for lack of a better word). Based on my experience, what I find is that a majority of analytics professionals and modelers prefer to complicate – they talk about complicated statistical techniques they use for solving the problem (usually with fancy names) and bulldoze the counterparty into thinking they are indeed doing something hi-funda.

The other approach, followed by (in my opinion) a smaller number of people, is to simplify. You try and explain your model in simple terms that the counterparty will understand. So if your final model contains only three explanatory variables, you tell them that only three variables are used, and you show how each of these variables (and combinations thereof) contribute to the model. You draw analogies to models the counterparty can appreciate, and use that to explain.

Now, like analytics professionals can be divided into two kinds (as above), I think consumers of analytics can also be divided into two kinds. There are those that like to understand the model, and those that simply want to get into the insights. The former are better served by the complicating type analytics professionals, and the latter by the simplifying type. The other two combinations lead to disaster.

Like a good management consultant, I represent this problem using the following two-by-two:

As a principle, I like to explain models in a simplified fashion, so that the consumer can completely understand it and use it in a way he sees appropriate. The more pragmatic among you, however, can take a guess on what type the consumer is and tweak your communication accordingly.

## Black Box Models

A few years ago, Felix Salmon wrote this article in Wired called “The Formula That Killed Wall Street“. It was about a formula called “Gaussian Copula”, which was a formula for estimating the joint probability of a set of events happening, if you knew the individual probabilities. It was a mathematical breakthrough.

Unfortunately, it fell into the hands of quants and traders who didn’t fully understand it, and they used it to derive joint probabilities of a large number of instruments put together. What they did not realize was that there was an error in the model (as there is in all models), and when they used the formula to tie up a large number of instruments, this error cascaded, resulting in an extremely inaccurate model, and subsequent massive losses (the last paragraph is based on my reading of the situation. Your mileage might vary).

In a blog post earlier this week at Reuters, Salmon returned to this article. He said:

And you can’t take technology further than its natural limits, either. It wasn’t really the Gaussian copula function which killed Wall Street, nor was it the quants who wielded it. Rather, it was the quants’ managers — the people whose limited understanding of copula functions and value-at-risk calculations allowed far too much risk to be pushed out into the tails. On Wall Street, just as in the rest of industry, a little bit of common sense can go a very long way.

I’m completely with him on this one. This blog post was in reference to Salmon’s latest article in Wired, which is about the four stages in which quants disrupt industries. You are encouraged to read both the Wired article and the blog post about it.

The essence is that it is easy to over-do analytics. Once you have a model that works in a few cases, you will end up putting too much faith into the model, and soon the model will become gospel, and you will build the rest of the organization around the model (this is Stage Three that Salmon talks about). For example, a friend who is a management consultant once mentioned about how bank lending practices are now increasingly formula driven. He mentioned reading a manager’s report that said “I know the applicant well, and am confident that he will repay the loan. However, our scoring system ranks him too low, hence I’m unable to offer the loan“.

The key issue, as Salmon mentions in his blog post, is that managers need to have at least a basic understanding of analytics (I had touched upon this issue in an earlier blog post). As I had written in that blog post, there can be two ways in which the analytics team can end up not contributing to the firm – firstly, people think they are geeks who nobody understands, and ignores them. Secondly, and perhaps more dangerously, people think of the analytics guys as gods, and fail to challenge them sufficiently, thus putting too much faith in models.

From this perspective, it is important for the analytics team to communicate well with the other managers – to explain the basic logic behind the models, so that the managers can understand the assumptions and limitations, and can use the models in the intended manner. What usually happens, though, is that after a few attempts when management doesn’t “get” the models, the analytics people resign themselves to using technical jargon and three letter acronyms to bulldoze their models past the managers.

The point of this post, however, is about black box models. Sometimes, you can have people (either analytics professionals or managers) using models without fully understanding them, and their assumptions. This inevitably leads to disaster. A good example of this are the traders and quants who used David Li’s Gaussian Copula, and ended up with horribly wrong models.

In order to prevent this, a good practice would be for the analytics people to be able to explain the model in an intuitive fashion (without using jargon) to the managers, so that they all understand the essence and nuances of the model in question. This, of course, means that you need to employ analytics people who are capable of effectively communicating their ideas, and employ managers who are able to at least understand some basic quant.

## On finding the right signal

It is not necessary that every problem yields a “signal”. It is well possible that sometimes you try to solve a problem using data and you are simply unable to find any signal. This does not mean that you have failed in your quest – the fact that you have found the absence of a signal means is valuable information and needs to be appreciated.

Sometimes, however, clients and consumers of analytics fail to appreciate this. In their opinion, if you fail to find an answer to a particular problem, you as an analyst have failed in your quest. They think that with a better analyst or better analysis it is possible to get a superior signal.

This failure by consumers of analytics to appreciate that sometimes there need not be a signal can sometimes lead to fudging. Let us say you have a data set where there is a very weak signal – let us say that all your explanatory variables explain about 1% of the variance in the dependent variable. In most cases (unless you are trading – in which case a 1% signal has some value), there is little value to be gleaned from this, and you are better off without applying a model. However, the fact that the client may not appreciate you if you give “no” as an answer can lead you to propose this 1% explanatory model as truth.

What one needs to recognize is that a bad model can sometimes subtract value. One of my clients once was using this model that had been put in place by an earlier consultant. This model had prescribed certain criteria they had to follow in recruitment, and I was asked to take a look at it. What I found was that the model showed absolutely no “signal” – based on my analysis, people with a high score as per that model were no more likely to do better than those that scored low based on that model!

You might ask what the problem with such a model is. The problem is that by recommending a certain set of scores on a certain set of parameters, the model was filtering out a large number of candidates, and without any basis. Thus, using a poor model, the company was trying to recruit out of a much smaller pool, which led to lesser choice for the hiring managers which led to suboptimal decisions. I remember closing that case with a recommendation to dismantle the model (since it wasn’t giving much of a signal anyway) and to instead simply empower the hiring manager!

Essentially companies need to recognize two things. Firstly, not having a model is better than having a poor model, for a poor model can subtract value and lead to suboptimal decision-making. Secondly, not every problem has a quantitative solution. It is very well possible that there is absolutely no signal in the data. So if no signal exists, the analyst is not at fault if she doesn’t find a signal! In fact, she would be dishonest if she were to report a signal when none existed!

It is important that companies keep these two things in mind while hiring a consultant to solve a problem using data.

## Missed opportunities in cross-selling

Talk to any analytics or “business intelligence” provider – be it a large commoditized outsourcing firm or a rather niche consultant – and one thing they all claim to advise their clients on is strategies for “cross sell”. However, my personal experience suggests that implementation of cross-sell strategies among retailers I encounter is extremely poor. I will illustrate two such examples in this post here.

Jet Airways and American Express together have come up with this “Jet Airways American Express Platinum Credit Card”. Like any other co-branded credit card, it offers you additional benefits on Jet Airways flights booked with this card (in terms of higher points) as well as some other benefits such as lounge access for economy travel. Given that I’m a consultant and travel frequently, this is something I think is good to have, and have attempted to purchase it a few times. And got discouraged by the purchase process each time and backed out.

Now, I’m a customer of both Jet Airways and American Express. I hold an American Express Gold Card (perhaps one of the few people to have an individual AmEx card), and have a Jet Privilege account. Yet, neither Jet or Amex seems remotely interested in selling to me. I once remember applying for this card through the Amex call centre. The person at the other end of the line wanted me to fill up the entire form once again – despite me being already a cardholder. This I would ascribe to messed up incentive structures where the salesperson at the other end gets higher benefits for acquiring a new customer rather than upgrading an existing one. I’ve mentioned I want this card to the Amex call centre several times, yet no one has called me back.

However, these are not the missed cross-sell opportunities I’m talking about in this post. Three times in the last three months (maybe more, but I cannot recollect) I’ve booked an air ticket to fly on Jet airways from the Jet Airways website having logged into my Jet Privilege account and paying with my American Express card. Each time I’ve waited hopefully that some system at either the Jet or the Amex end will make the connection and offer me this Platinum card, but so far there has been response. It is perhaps the case that for some reason they do not want to upgrade existing customers to this card (in which case the entire discussion is moot) but not offering me a card here is simply a case of a blatant missed opportunity – in cricketing terms you can think of this as an easy dropped catch.

The other case has to do with banking. I’m in the process of purchasing a house, and over the last few months have been transferring large amounts of money to the seller in order to make my down payments (which I’m meeting through my savings). Now, I’ve had my account with Citibank for over seven years and have never withdrew such large amounts – except maybe to make some fixed deposits. One time, I got a call from the bank’s call centre, confirming if it was indeed I who had made the transfer. Why did the bank not think of finding out (in a discreet manner) why all of a sudden so much money had moved out of my account, and if I was up to purchasing something and if the bank could help? Of course, later, during a visit to the Citibank local branch recently I found I wouldn’t have got a loan from them anyway since they don’t finance apartments built by no-name builders that are still under construction (which fits the bill of the property I’m purchasing). Nevertheless – the large money transferred out of my account could have been for buying a property that the bank could have financed. Missed opportunity there?

My understanding of the situation is that in several “analytics” offerings there is a disconnect between the tech and the business sides. Somewhere along the chain of implementation there is one hand-off where one party knows only the business aspects and the other knows only technology, and thus the two are unable to converse, leading to suboptimal decisions. One kind of value I offer (hint! hint!!) is that I understand both tech and business, and I can ensure a much smoother hand-off between the technical and business aspects, thus leading to superior solution design.

## Nate Silver Interview At HBR Blogs

HBR Blogs has interviewed Nate Silver on analytics, building a career in analytics and how organizations should manage analytics. I agree with his views on pretty much everything. Some money quotes:

HBR: Say an organization brings in a bunch of ‘stat heads’ to use your terminology. Do you silo them in their own department that serves the rest of the company? Or is it important to make sure that every team has someone who has the analytic toolkit to pair with expertise?

Silver: I think you want to integrate it as much as possible. That means that they’re going to have some business skills, too, right? And learn that presenting their work is important. But you need it to be integrated into the fabric of the organization.

And this:

Silver: If you can’t present your ideas to at least a modestly larger audience, then it’s not going to do you very much good. Einstein supposedly said that I don’t trust any physics theory that can’t be explained to a 10-year-old. A lot of times the intuitions behind things aren’t really all that complicated. In Moneyball that on-base percentage is better than batting average looks like ‘OK, well, the goal is to score runs. The first step in scoring runs is getting on base, so let’s have a statistic that measures getting on base instead of just one type of getting on base.’ Not that hard a battle to fight.

And this:

Silver: A lot of times when data isn’t very reliable, intuition isn’t very reliable either. The problem is people see it as an either/or, when it sometimes is both or neither, as well. The question should be how good is a model relative to our spitball, gut-feel approach.

Go on and read the whole interview.

## Hedgehogs and foxes: Or, a day in the life of a quant

I must state at the outset that this post is inspired by the second chapter of Nate Silver’s book The Signal and the Noise. In that chapter, which is about election forecasting, Silver draws upon the old Russian parable of the hedgehog and the fox. According to that story, the fox knows several tricks while the hedgehog knows only one – curling up into a ball. The story ends in favour of the hedgehog, as none of the tricks of the unfocused fox can help him evade the predator.

Most political pundits, says Silver, are like hedgehogs. They have just one central idea to their punditry and they tend to analyze all issues through that. A good political forecaster, however, needs to be able to accept and process any new data inputs, and include that in his analysis. With just one technique, this can be hard to achieve and so Silver says that to be a good political forecaster one needs to be a fox. While this might lead to some contradictory statements and thus bad punditry, it leads to good forecasts. Anyway, you can know about election forecasting from Silver’s book.

The world of “quant” and “analytics” which I inhabit is again similarly full of hedgehogs. You have the statisticians, whose solution for every problem is a statistical model. They can wax eloquent about Log Likelihood Estimators but can have trouble explaining why you should use that in the first place. Then you have the banking quants (I used to be one of those), who are proficient in derivatives pricing, stochastic calculus and partial differential equations, but if you ask them why a stock price movement is generally assumed to be lognormal, they don’t have answers. Then you have the coders, who can hack, scrape and write really efficient code, but don’t know much math. And mathematicians who can come up with elegant solutions but who are divorced from reality.

While you might make a career out of falling under any of the above categories, to truly unleash your potential as a quant, you should be able to do all. You should be a fox and should know each of these  tricks. And unlike the fox in the Old Russian fairy tale, the key to being a good fox is to know what trick to use when. Let me illustrate this with an example from my work today (actual problem statement masked since it involves client information).

So there were two possible distributions that a particular data point could have come from and I had to try and analyze which of them it came from (simple Bayesian probability, you might think). However, calculating the probability wasn’t so straightforward, as it wasn’t a standard function. Then I figured I could solve the probability problem using the inclusion-exclusion principle (maths again), and wrote down a mathematical formulation for it.

Now, I was dealing with a rather large data set, so I would have to use the computer, so I turned my mathematical solution into pseudo-code. Then, I realized that the pseudo-code was recursive, and given the size of the problem I would soon run out of memory. I had to figure out a solution using dynamic programming. Then, following some more code optimization, I had the probability. And then I had to go back to do the Bayesian analysis in order to complete the solution. And then present the solution in the form of a “business solution”, with all the above mathematical jugglery being abstracted from the client.

This versatility can come in handy in other places, too. There was a problem for which I figured out that the appropriate solution involved building a classification tree. However, given the nature of the data at hand, none of the off-the-shelf classification tree algorithms for were ideal. So I simply went ahead and wrote my own code for creating such trees. Then, I figured that classification trees are in effect a greedy algorithm, and can lead to getting stuck at local optima. And so I put in a simulated annealing aspect to it.

While I may not have in depth knowledge of any of the above techniques (to gain breadth you have to sacrifice depth), that I’m aware of a wide variety of techniques means I can provide the solution that is best for the problem at hand. And as I go along, I hope to keep learning more and more techniques – even if I don’t use them, being aware of them will lead to better overall problem solving.

## Why standard deviation is not a good measure of volatility

Most finance textbooks, at least the ones that are popular in Business Schools, use standard deviation as a measure of volatility of a stock price. In this post, we will examine why it is not a great idea. To put it in one line, the use of standard deviation loses information on the ordering of the price movement.

As earlier, let us look at two data sets and try to measure their volatility. Let us consider two time series (let’s simply call them “series1” and “series2”) and try and compare their volatilities. The table here shows the two series:

What can you say of the two series now? You think they are similar? You might notice that both contain the same set of numbers, but jumbled up.  Let us look at the volatility as expressed by standard deviation. Unsurprisingly, since both series contain the same set of numbers, the volatility of both series is identical – at 8.655.

However, does this mean that the two series are equally volatile? Not particularly, as you can see from this graph of the two series:

It is clear from the graph (if it was not clear from the table already) that Series 2 is much more volatile than series 1. So how can we measure it? Most textbooks on quantitative finance (as opposed to textbooks on finance) use “Quadratic Variation” as a measure of volatility. How do we measure quadratic variation?

If we have a series of numbers from $a_1$ to $a_n$, then the quadratic variation of this series is measured as

$sum_{i=2 to n} (a_i - a_{i-1})^2$

Notice that the primary difference feature of the quadratic variation is that it takes into account the sequence. So when you have something like series 2, with alternating positive and negative jumps, it gets captured in the quadratic variation. So what would be the quadratic variation values for the two time series we have here?

The QV of series 1 is 29 while that of series 2 is a whopping 6119, which is probably a fair indicator of their relative volatilities.

So why standard deviation?

Now you might ask why textbooks use standard deviation at all then, if it misses out so much of the variation. The answer, not surprisingly, lies in quantitative finance. When the price of a stock (X) is governed by a Wiener process, or

$dX = sigma dW$

then the quadratic variation of the stock price (between time 0 and time t) can be shown to be $sigma^2 t$, which for t = 1 is $sigma^2$ which is the variance of the process.

Because for a particular kind of process, which is commonly used to model stock price movement, the quadratic variation is equal to variance, variance is commonly used as a substitute for quadratic variation as a measure of volatility.

However, considering that in practice stock prices are seldom Brownian (either arithmetic or geometric), this equivalence doesn’t necessarily hold.

This is also a point that Benoit Mandelbrot makes in his excellent book The (mis)Behaviour of Markets. He calls this the Joseph effect (he uses the biblical story of Joseph, who dreamt of seven fat cows being eaten by seven lean foxes, and predicted that seven years of Nile floods would be followed by seven years of drought). Financial analysts, by using a simple variance (or standard deviation) to characterize volatility, miss out on such serial effects.

## Anscombe’s Quartet and Analytics

Many “analytics professionals” or “quants” I know or have worked with have no hesitation in diving straight into a statistical model when they are faced with a problem, rather than trying to understand the data. However, that is not the way I work. Whenever I set out solving a new problem, I start with spending considerable time trying to get a feel of the data. There are many things I do to “feel” the data – look at a few lines of data, look at descriptive statistics of some of the variables and distributions of individual variables. The most powerful tool, however, that lets me get a feel for data is the humble scatterplot.

The beauty of the scatter plot is that it allows you to get a real feel for the data. Taking variables two at a time, it not only shows you how each of them is distributed but also how they are related to each other. Relationships that are not apparent when you look at the data become apparent when you graph them. I may not be wrong in saying that the scatterplot defines the direction and scope of your entire solution.

The problem with the debate on how analytics needs to be done is that it is loaded. A large majority of people who use statistics in their daily work dive straight into analysis without looking at the data. Perhaps they deem that looking at data is a waste of time? I have even seen pitch decks by extremely reputed software companies that propose solutions such as “we will solve this problem using Logistic Regression” without even having seen the data.

Let us take an example now. Take the following four data sets (my apologies for putting an image here):

Let us say you dive straight into the analysis. Like a good “analytics professional” you dive straight into regression. You may even do some descriptive statistics for each of the data sets along the way. And this is what you find (again, apologies for the image)

Do you conclude that the four data sets are the same? Pretty much identical statistics right? I wouldn’t be surprised if you were to publish that there is nothing to differentiate between these four data sets. Now, let us do a simple scatter plot of each of these data sets and check for ourselves:

Now, do you still think these data sets are identical? Now you know why I stress so much upon getting a feel for the data and drawing the humble scatter plot?

The data set I’ve used here is a rather famous one, and it is called Anscombe’s Quartet. The purpose of the data set is to precisely describe what I have in this post. That one needs to get a feel for the data before diving into the analysis. Draw scatter plots for every pair of variables. Understand the relationships, and let this understanding guide your further analysis. If one were able to perfectly analyze every piece of data by diving straight into a regression, the job of analytics might as well be outsourced to computers.

PS: it is a tragedy that when they teach visualization in school they don’t even mention the scatter plot. At a recent workshop I asked the participants to name the different kinds of graphs they knew. “Line”, “Bar” and “Pie” were the mots common answers. Not one answered “scatter plot”. Given the utility of this simple plot this is indeed tragic.

## Exponential increase

“Increasing exponentially” is a common phrase used by analysts and commentators to describe a certain kind of growth. However, more often than not, this phrase is misused and any fast growth is termed as “exponential”.

If f(x) is a function of x, f(x) is said to be exponential if and only if it can be written in the form:

$f(x) = K alpha ^x$

So if your growth is quick but linear, then it is not exponential. While on the topic, I want to point you to two posts I’ve written on my public policy blog for Takshashila: one on exponential growth in bank transfers that I wrote earlier today and another on exponential growth in toilet ownership. The former shows you how to detect exponential growth and in the latter I use what I think is an innovative model to model toilet growth as exponential.

## Should you have an analytics team?

In an earlier post, I had talked about the importance of business people knowing numbers and numbers people knowing business, and had put in a small advertisement for my consulting services by mentioning that I know both business and numbers and work at their cusp. In this post, I take that further and analyze if it makes sense to have a dedicated analytics team.

Following the data boom, most companies have decided (rightly) that they need to do something to take advantage of all the data that they have and have created dedicated analytics teams. These teams, normally staffed with people from a quantitative or statistical background, with perhaps a few MBAs, is in charge of taking care of all the data the company has along with doing some rudimentary analysis. The question is if having such dedicated teams is effective or if it is better to have numbers-enabled people across the firm.

Having an analytics team makes sense from the point of view of economies of scale. People who are conversant with numbers are hard to come by, and when you find some, it makes sense to put them together and get them to work exclusively on numerical problems. That also ensures collaboration and knowledge sharing and that can have positive externalities.

Then, there is the data aspect. Anyone doing business analytics within a firm needs access to data from all over the firm, and if the firm doesn’t have a centralized data warehouse which houses all its data, one task of each analytics person would be to get together the data that they need for their analysis. Here again, the economies of scale of having an integrated analytics team work. The job of putting together data from multiple parts of the firm is not solved multiple times, and thus the analysts can spend more time on analyzing rather than collecting data.

So far so good. However, writing a while back I had explained that investment banks’ policies of having exclusive quant teams have doomed them to long-term failure. My contention there (including an insider view) was that an exclusive quant team whose only job is to model and which doesn’t have a view of the market can quickly get insular, and can lead to groupthink. People are more likely to solve for problems as defined by their models rather than problems posed by the market. This, I had mentioned can soon lead to a disconnect between the bank’s models and the markets, and ultimately lead to trading losses.

Extending that argument, it works the same way with non-banking firms as well. When you put together a group of numbers people and call them the analytics group, and only give them the job of building models rather than looking at actual business issues, they are likely to get similarly insular and opaque. While initially they might do well, soon they start getting disconnected from the actual business the firm is doing, and soon fall in love with their models. Soon, like the quants at big investment banks, they too will start solving for their models rather than for the actual business, and that prevents the rest of the firm from getting the best out of them.

Then there is the jargon. You say “I fitted a multinomial logistic regression and it gave me a p-value of 0.05 so this model is correct”, the business manager without much clue of numbers can be bulldozed into submission. By talking a language which most of the firm understands you are obscuring yourself, which leads to two responses from the rest. Either they deem the analytics team to be incapable (since they fail to talk the language of business, in which case the purpose of existence of the analytics team may be lost), or they assume the analytics team to be fundamentally superior (thanks to the obscurity in the language), in which case there is the risk of incorrect and possibly inappropriate models being adopted.

I can think of several solutions for this – but irrespective of what solution you ultimately adopt –  whether you go completely centralized or completely distributed or a hybrid like above – the key step in getting the best out of your analytics is to have your senior and senior-middle management team conversant with numbers. By that I don’t mean that they all go for a course in statistics. What I mean is that your middle and senior management should know how to solve problems using numbers. When they see data, they should have the ability to ask the right kind of questions. Irrespective of how the analytics team is placed, as long as you ask them the right kind of questions, you are likely to benefit from their work (assuming basic levels of competence of course). This way, they can remain conversant with the analytics people, and a middle ground can be established so that insights from numbers can actually flow into business.

So here is the plug for this post – shortly I’ll be launching short (1-day) workshops for middle and senior level managers in analytics. Keep watching this space