Stirring the pile efficiently

Warning: This is a technical post, and involves some code, etc. 

As I’ve ranted a fair bit on this blog over the last year, a lot of “machine learning” in the industry can be described as “stirring the pile”. Regular readers of this blog will be familiar with this image from XKCD by now:

Source: https://xkcd.com/1838/

Basically people simply take datasets and apply all the machine learning techniques they have heard of (implementation is damn easy – scikit learn allows you to implement just about any model in three similar looking lines of code; See my code here to see how similar the implementation is).

So I thought I’ll help these pile-stirrers by giving some hints of what method to use for different kinds of data. I’ve over-simplified stuff, and so assume that:

  1. There are two predictor variables X and Y. The predicted variable “Z” is binary.
  2. X and Y are each drawn from a standard normal distribution.
  3. The predicted variable Z is “clean” – there is a region in the X-Y plane where Z is always “true” and another region where Z is always “false”
  4. So the idea is to see which machine learning techniques are good at identifying which kind of geometrical figures.
  5. Everything is done “in-sample”. Given the nature of the data, it doesn’t matter if we do it in-sample or out-of-sample.

For those that understand Python (and every pile-stirrer worth his salt is excellent at Python), I’ve put my code in a nice Jupyter Notebook, which can be found here.

So this is what the output looks like. The top row shows the “true values” of Z. Then we have a row for each of the techniques we’ve used, which shows how well these techniques can identify the pattern given in the top row (click on the image for full size).

As you can see, I’ve chosen some common geometrical shapes and seen which methods are good at identifying those. A few pertinent observations:

  1. Logistic regression and linear SVM are broadly similar, and both are shit for this kind of dataset. Being linear models, they fail to deal with non-linear patterns
  2. SVM with RBF kernel is better, but it fails when there are multiple “true regions” in the dataset. At least it’s good at figuring out some non-linear patterns. However, it can’t figure out the triangle or square – it draws curves around them, instead.
  3. Naive Bayesian (I’ve never understood this even though I’m pretty good at Bayesian statistics, but I understand this is a commonly used technique; and I’ve used default parameters so not sure how it is “Bayesian” even) can identify some stuff but does badly when there are disjoint regions where Z is true.
  4. Ensemble methods such as Random Forests and Gradient Boosting do rather well on all the given inputs. They do well for both polygons and curves. Elsewhere, Ada Boost mostly does well but trips up on the hyperbola.
  5. For some reason, Lasso fails to give an output (in the true spirit of pile-stirring, I didn’t explore why). Ridge is again a regression method and so does badly on this non-linear dataset
  6. Neural Networks (Multi Layer Perceptron to be precise) does reasonably well, but can’t figure out the sharp edges of the polygons.
  7. Decision trees again do rather well. I’m pleasantly surprised that they pick up and classify the disjoint sets (multi-circle and hyperbola) correctly. Maybe it’s the way scikit learn implements them?

Of course, the datasets that one comes across in real life are never such simple geometrical figures, but I hope that this set can give you some idea on what techniques to use where.

At least I hope that this makes you think about the suitability of different techniques for the data rather than simply applying all the techniques you know and then picking the one that performs best on your given training and test data.

That would count as nothing different from p-hacking, and there’s an XKCD for that as well!

Source: https://xkcd.com/882/

Astrology and Data Science

The discussion goes back some 6 years, when I’d first started setting up my data and management consultancy practice. Since I’d freshly quit my job to set up the said practice, I had plenty of time on my hands, and the wife suggested that I spend some of that time learning astrology.

Considering that I’ve never been remotely religious or superstitious, I found this suggestion preposterous (I had a funny upbringing in the matter of religion – my mother was insanely religious (including following a certain Baba), and my father was insanely rationalist, and I kept getting pulled in both directions).

Now, the wife has some (indirect) background in astrology. One of her aunts is an astrologer, and specialises in something called “prashNa shaastra“, where the prediction is made based on the time at which the client asks the astrologer a question. My wife believes this has resulted in largely correct predictions (though I suspect a strong dose of confirmation bias there), and (very strangely to me) seems to believe in the stuff.

“What’s the use of studying astrology if I don’t believe in it one bit”, I asked. “Astrology is very mathematical, and you are very good at mathematics. So you’ll enjoy it a lot”, she countered, sidestepping the question.

We went off into a long discussion on the origins of astrology, and how it resulted in early developments in astronomy (necessary in order to precisely determine the position of planets), and so on. The discussion got involved, and involved many digressions, as discussions of this sort might entail. And as you might expect with such discussions, my wife threw a curveball, “You know, you say you’re building a business based on data analysis. Isn’t data analysis just like astrology?”

I was stumped (ok I know I’m mixing metaphors here), and that had ended the discussion then.

Until I decided to bring it up recently. As it turns out, once again (after a brief hiatus when I decided I’ll do a job) I’m in process of setting up a data and management consulting business. The difference is this time I’m in London, and that “data science” is a thing (it wasn’t in 2011). And over the last year or so I’ve been kinda disappointed to see what goes on in the name of “data science” around me.

This XKCD cartoon (which I’ve shared here several times) encapsulates it very well. People literally “pour data into a machine learning system” and then “stir the pile” hoping for the results.

Source: https://xkcd.com/1838/

In the process of applying fairly complex “machine learning” algorithms, I’ve seen people not really bother about whether the analysis makes intuitive sense, or if there is “physical meaning” in what the analysis says, or if the correlations actually determine causation. It’s blind application of “run the data through a bunch of scikit learn models and accept the output”.

And this is exactly how astrology works. There are a bunch of predictor variables (position of different “planets” in various parts of the “sky”). There is the observed variable (whether some disaster happened or not, basically), which is nicely in binary format. And then some of our ancients did some data analysis on this, trying to identify combinations of predictors that predicted the output (unfortunately they didn’t have the power of statistics or computers, so in that sense the models were limited). And then they simply accepted the outputs, without challenging why it makes sense that the position of Jupiter at the time of wedding affects how your marriage will go.

So I brought up the topic of astrology and data science again recently, saying “OK after careful analysis I admit that astrology is the oldest form of data science”. “That’s not what I said”, the wife countered. “I said that data science is new age astrology, and not the other way round”.

It’s hard to argue with that!

Profit and politics

Earlier today I came across this article about data scientists on LinkedIn that I agreed with so much that I started wondering if it was simply a case of confirmation bias.

A few sentences (possibly taken out of context) from there that I agree with:

  • Many large companies have fallen into the trap that you need a PhD to do data science, you don’t.
  • There are some smart people who know a lot about a very narrow field, but data science is a very broad discipline. When these PhD’s are put in charge, they quickly find they are out of their league.
  • Often companies put a strong technical person in charge when they really need a strong business person in charge.
  •  I always found the academic world more political than the corporate world and when your drive is profits and customer satisfaction, that academic mindset is more of a liability than an asset.

Back to the topic, which is the last of these sentences. This is something I’ve intended to write for 5-6 years now, since the time I started off as an independent management consultant.

During the early days I took on assignments from both for-profit and not-for-profit organisations, and soon it was very clear that I enjoyed working with for-profit organisations a lot more. It wasn’t about money – I was fairly careful in my negotiations to never underprice myself. It was more to do with processes, and interactions.

The thing in for-profit companies is that objectives are clear. While not everyone in the company has an incentive to increase the bottom-line, it is not hard to understand what they want based on what they do.

For example, in most cases a sales manager optimises for maximum sales. Financial controllers want to keep a check on costs. And so on. So as part of a consulting assignment, it’s rather easy to know who wants what, and how you should pitch your solution to different people in order to get buy-in.

With a not-for-profit it’s not that clear. While each person may have their own metrics and objectives, because the company is not for profit, these objectives and metrics need not be everything they’re optimising for.

Moreover, in the not for profit world, the lack of money or profit as an objective means you cannot differentiate yourself with efficiency or quantity. Take the example of an organisation which, for whatever reason, gets to advice a ministry on a particular subject, and does so without a fee or only for a nominal fee.

How can a competitor who possibly has a better solution to the same problem “displace” the original organisation? In the business world, this can be done by showing superior metrics and efficiency and offering to do the job at a lower cost and stuff like that. In the not-for-profit setup, you can’t differentiate on things like cost or efficiency, so the only thing you can do is to somehow provide your services in parallel and hope that the client gets it.

And then there is access. If you’re a not-for-profit consultant who has a juicy project, it is in your interest to become a gatekeeper and prevent other potential consultants from getting the same kind of access you have – for you never know if someone else who might get access through you might end up elbowing you out.

The (missing) Desk Quants of Main Street

A long time ago, I’d written about my experience as a Quant at an investment bank, and about how banks like mine were sitting on a pile of risk that could blow up any time soon.

There were two problems as I had documented then. Firstly, most quants I interacted with seemed to be solving maths problems rather than finance problems, not bothering if their models would stand the test of markets. Secondly, there was an element of groupthink, as quant teams were largely homogeneous and it was hard to progress while holding contrarian views.

Six years on, there has been no blowup, and in some sense banks are actually doing well (I mean, they’ve declined compared to the time just before the 2008 financial crisis but haven’t done that badly). There have been no real quant disasters (yes I know the Gaussian Copula gained infamy during the 2008 crisis, but I’m talking about a period after that crisis).

There can be many explanations regarding how banks have not had any quant blow-ups despite quants solving for math problems and all thinking alike, but the one I’m partial to is the presence of a “middle layer”.

Most of the quants I interacted with were “core” in the sense that they were not attached to any sales or trading desks. Banks also typically had a large cadre of “desk quants” who are directly associated with trading teams, and who build models and help with day-to-day risk management, pricing, etc.

Since these desk quants work closely with the business, they turn out to be much more pragmatic than the core quants – they have a good understanding of the market and use the models more as guiding principles than as rules. On the other hand, they bring the benefits of quantitative models (and work of the core quants) into day-to-day business.

Back during the financial crisis, I’d jokingly predicted that other industries should hire quants who were now surplus to Wall Street. Around the same time, DJ Patil et al came up with the concept of the “data scientist” and called it the “sexiest job of the 21st century”.

And so other industries started getting their own share of quants, or “data scientists” as they were now called. Nowadays its fashionable even for small companies for whom data is not critical for business to have a data science team. Being in this profession now (I loathe calling myself a “data scientist” – prefer to say “quant” or “analytics”), I’ve come across quite a few of those.

The problem I see with “data science” on “Main Street” (this phrase gained currency during the financial crisis as the opposite of Wall Street, in that it referred to “normal” businesses) is that it lacks the cadre of desk quants. Most data scientists are highly technical people who don’t necessarily have an understanding of the business they operate in.

Thanks to that, what I’ve noticed is that in most cases there is a chasm between the data scientists and the business, since they are unable to talk in a common language. As I’m prone to saying, this can go two ways – the business guys can either assume that the data science guys are geniuses and take their word for the gospel, or the business guys can totally disregard the data scientists as people who do some esoteric math and don’t really understand the world. In either case, value added is suboptimal.

It is not hard to understand why “Main Street” doesn’t have a cadre of desk quants – it’s because of the way the data science industry has evolved. Quant at investment banks has evolved over a long period of time – the Black-Scholes equation was proposed in the early 1970s. So the quants were first recruited to directly work with the traders, and core quants (at the banks that have them) were a later addition when banks realised that some quant functions could be centralised.

On the other hand, the whole “data science” growth has been rather sudden. The volume of data, cheap incrementally available cloud storage, easy processing and the popularity of the phrase “data science” have all increased well-at-a-faster rate in the last decade or so, and so companies have scrambled to set up data teams. There has simply been no time to train people who get both the business and data – and the data scientists exist like addendums that are either worshipped or ignored.

Maths, machine learning, brute force and elegance

Back when I was at the International Maths Olympiad Training Camp in Mumbai in 1999, the biggest insult one could hurl at a peer was to describe the latter’s solution to a problem as being a “brute force solution”. Brute force solutions, which were often ungainly, laboured and unintuitive were supposed to be the last resort, to be used only if one were thoroughly unable to implement an “elegant solution” to the problem.

Mathematicians love and value elegance. While they might be comfortable with esoteric formulae and the Greek alphabet, they are always on the lookout for solutions that are, at least to the trained eye, intuitive to perceive and understand. Among other things, it is the belief that it is much easier to get an intuitive understanding for an elegant solution.

When all the parts of the solution seem to fit so well into each other, with no loose ends, it is far easier to accept the solution as being correct (even if you don’t understand it fully). Brute force solutions, on the other hand, inevitably leave loose ends and appreciating them can be a fairly massive task, even to trained mathematicians.

In the conventional view, though, non-mathematicians don’t have much fondness for elegance. A solution is a solution, and a problem solved is a problem solved.

With the coming of big data and increased computational power, however, the tables are getting turned. In this case, the more mathematical people, who are more likely to appreciate “machine learning” algorithms recommend “leaving it to the system” – to unleash the brute force of computational power at the problem so that the “best model” can be found, and later implemented.

And in this case, it is the “half-blood mathematicians” like me, who are aware of complex algorithms but are unsure of letting the system take over stuff end-to-end, who bat for elegance – to look at data, massage it, analyse it and then find that one simple method or transformation that can throw immense light on the problem, effectively solving it!

The world moves in strange ways.

Newsletter!

So after much deliberation and procrastination, I’ve finally started a newsletter. I call it “the art of data science” and the title should be self-explanatory. It’s pure unbridled opinion (the kind of which usually goes on this blog), except that I only write about one topic there.

I intend to have three sections and then a “chart of the edition” (note how cleverly I’ve named this section to avoid giving much away on the frequency of the newsletter!). This edition, though, I ended up putting too much harikathe, so I restricted to two sections before the chart.

I intend to talk a bit each edition about some philosophical part of dealing with data (this section got a miss this time), a bit on data analysis methods (I went a bit meta on this this time) and a little bit on programming languages (which I used for bitching a bit).

And that I plan to put a “chart of the edition” means I need to read newspapers a lot more, since you are much more likely to find gems (in either direction) there than elsewhere. For the first edition, I picked off a good graph I’d seen on Twitter, and it’s about Hull City!

Anyway, enough of this meta-harikathe. You can read the first edition of the newsletter here. In case you want to get it in your inbox each week/fortnight/whenever I decide to write it, then subscribe here!

And put feedback (by email, not comments here) on what you think of the newsletter!

High dimension and low dimension data science

I’ve observed that there are two broad approaches that people take to getting information out of data. One approach is to simply throw a kitchen sink full of analytical techniques at the data. Without really trying to understand what the data looks like, and what the relationships may be, the analyst simply uses one method after another to try and get insight from the data. Along the way, a “model” will get built.

The other approach (which I’m partial to) involves understanding each variable, and relationship between variables as a first step to getting insight from the data. Here, too, a model might get built, but it will be conditional on the analyst’s view on what kind of a model might suit the data after looking at the data.

Considering that both these approaches are used by large numbers of analysts, it is highly likely that both are legitimate. Then what explains the fact that some analysts use one approach, and others use another? Having thought about it for a long time, I have a hypothesis – it depends on the kind of data being analysed. More precisely, it has to do with the dimensionality of the data.

The first approach (which one might classify as “machine learning”) works well when the data is of high dimensions – where the number of predictors that can be used for predictors is really large, of the order of thousands or larger. For example, even a seemingly low-resolution 32 by 32 pixel image, looked at as a data point, has 1024 dimensions (colour of the image at each of the 1024 pixels is a different dimension).

Moreover, in such situations, it is likely that the signal in the data doesn’t come from one, or, two, or a handful of predictors. In high dimension data science, the signal usually comes from complex interplay of data along various dimensions. And this kind of search is not something humans are fit for – it is best that the machines are left to “learn” the model by themselves, and so you get machine learning.

On the other hand, when the dimensionality of the dataset is low, it is possible (and “easy”) for an analyst to look at the interplay of factors in detail, and understand the data before going on to build the model. Doing so can help the analyst identify patterns in the data that may not be that apparent to a machine, and it is also likely that in such datasets, the signal  lies with data along a small number of dimensions, where relatively simple manipulation will suffice. The low dimensionality also means that complex machine learning techniques are unlikely to contribute much in such cases.

As you might expect, from an organisational perspective, the solution is quite simple – to deploy high-dimension data scientists on high-dimension problems, and likewise with low-dimension data scientists. Since this distinction between high-dimension and low-dimension data scientists isn’t very well known, it’s quite possible that the scientists might be made to work on a problem of dimensionality that is outside of their comfort zone.

So when you have low dimensional data scientists faced with a large number of dimensions of data, you will see them use brute force to try and find signals in bivariate relationships in the data – an approach that will never work since the signal lies in a more complex interplay of dimensions.

On the other hand, when you put high dimension data scientists on a low dimension problem, you will either see them missing out on associations that a human could easily find but a machine might find hard to find, or they might unnecessarily “reduce the problem to a known problem” by generating and importing large amounts of data in order to turn it into a high dimension problem!

PS: I wanted to tweet this today but forgot. Basically, you use logistic regression when you think the signal is an “or” of conditions on some of the underlying variables. On the other hand, if you think the signal is more likely to be an “and” condition of certain conditions, then you should use decision trees!