High dimension and low dimension data science

I’ve observed that there are two broad approaches that people take to getting information out of data. One approach is to simply throw a kitchen sink full of analytical techniques at the data. Without really trying to understand what the data looks like, and what the relationships may be, the analyst simply uses one method after another to try and get insight from the data. Along the way, a “model” will get built.

The other approach (which I’m partial to) involves understanding each variable, and relationship between variables as a first step to getting insight from the data. Here, too, a model might get built, but it will be conditional on the analyst’s view on what kind of a model might suit the data after looking at the data.

Considering that both these approaches are used by large numbers of analysts, it is highly likely that both are legitimate. Then what explains the fact that some analysts use one approach, and others use another? Having thought about it for a long time, I have a hypothesis – it depends on the kind of data being analysed. More precisely, it has to do with the dimensionality of the data.

The first approach (which one might classify as “machine learning”) works well when the data is of high dimensions – where the number of predictors that can be used for predictors is really large, of the order of thousands or larger. For example, even a seemingly low-resolution 32 by 32 pixel image, looked at as a data point, has 1024 dimensions (colour of the image at each of the 1024 pixels is a different dimension).

Moreover, in such situations, it is likely that the signal in the data doesn’t come from one, or, two, or a handful of predictors. In high dimension data science, the signal usually comes from complex interplay of data along various dimensions. And this kind of search is not something humans are fit for – it is best that the machines are left to “learn” the model by themselves, and so you get machine learning.

On the other hand, when the dimensionality of the dataset is low, it is possible (and “easy”) for an analyst to look at the interplay of factors in detail, and understand the data before going on to build the model. Doing so can help the analyst identify patterns in the data that may not be that apparent to a machine, and it is also likely that in such datasets, the signal  lies with data along a small number of dimensions, where relatively simple manipulation will suffice. The low dimensionality also means that complex machine learning techniques are unlikely to contribute much in such cases.

As you might expect, from an organisational perspective, the solution is quite simple – to deploy high-dimension data scientists on high-dimension problems, and likewise with low-dimension data scientists. Since this distinction between high-dimension and low-dimension data scientists isn’t very well known, it’s quite possible that the scientists might be made to work on a problem of dimensionality that is outside of their comfort zone.

So when you have low dimensional data scientists faced with a large number of dimensions of data, you will see them use brute force to try and find signals in bivariate relationships in the data – an approach that will never work since the signal lies in a more complex interplay of dimensions.

On the other hand, when you put high dimension data scientists on a low dimension problem, you will either see them missing out on associations that a human could easily find but a machine might find hard to find, or they might unnecessarily “reduce the problem to a known problem” by generating and importing large amounts of data in order to turn it into a high dimension problem!

PS: I wanted to tweet this today but forgot. Basically, you use logistic regression when you think the signal is an “or” of conditions on some of the underlying variables. On the other hand, if you think the signal is more likely to be an “and” condition of certain conditions, then you should use decision trees!


Dreaming on about machine learning

I don’t know if I’ve written about this before (that might explain how I crossed 2000 blogposts last year – multiple posts about the same thing), but anyway – I’m writing this listening to Aerosmith’s Dream On.

I don’t recall when the first time was that I heard the song, but I somehow decided that it sounded like Led Zeppelin. It was before 2006, so I had no access to services such as Shazam to search effectively. So for a long time I continued to believe it was by Led Zep, and kept going through their archives to locate the song.

And then in 2006, Pandora happened. It became my full time work time listening (bless those offshored offices with fast internet and US proxies). I would seed stations with songs I liked (back then there was no option to directly play songs you liked – you could only seed stations). I discovered plenty of awesome music that way.

And then one day I had put on a Led Zeppelin station and started work. The first song was by Led Zeppelin itself. And then came Dream On. And I figured it was a song by Aerosmith. While I chided myself for not having identified the band correctly, I was happy that I hadn’t been that wrong – given that Pandora uses machine learning on song patterns to identify similar songs, that Dream On had appeared in a LedZep playlist meant that I hadn’t been too far off identifying it with that band.

Ten years on, I’m not sure why I thought Dream On was by Led Zeppelin – I don’t see any similarities any more. But maybe the algorithms know better!