data science – Page 3 – Pertinent Observations

Astrology and Data Science

The discussion goes back some 6 years, when I’d first started setting up my data and management consultancy practice. Since I’d freshly quit my job to set up the said practice, I had plenty of time on my hands, and the wife suggested that I spend some of that time learning astrology.

Considering that I’ve never been remotely religious or superstitious, I found this suggestion preposterous (I had a funny upbringing in the matter of religion – my mother was insanely religious (including following a certain Baba), and my father was insanely rationalist, and I kept getting pulled in both directions).

Now, the wife has some (indirect) background in astrology. One of her aunts is an astrologer, and specialises in something called “prashNa shaastra“, where the prediction is made based on the time at which the client asks the astrologer a question. My wife believes this has resulted in largely correct predictions (though I suspect a strong dose of confirmation bias there), and (very strangely to me) seems to believe in the stuff.

“What’s the use of studying astrology if I don’t believe in it one bit”, I asked. “Astrology is very mathematical, and you are very good at mathematics. So you’ll enjoy it a lot”, she countered, sidestepping the question.

We went off into a long discussion on the origins of astrology, and how it resulted in early developments in astronomy (necessary in order to precisely determine the position of planets), and so on. The discussion got involved, and involved many digressions, as discussions of this sort might entail. And as you might expect with such discussions, my wife threw a curveball, “You know, you say you’re building a business based on data analysis. Isn’t data analysis just like astrology?”

I was stumped (ok I know I’m mixing metaphors here), and that had ended the discussion then.

Until I decided to bring it up recently. As it turns out, once again (after a brief hiatus when I decided I’ll do a job) I’m in process of setting up a data and management consulting business. The difference is this time I’m in London, and that “data science” is a thing (it wasn’t in 2011). And over the last year or so I’ve been kinda disappointed to see what goes on in the name of “data science” around me.

This XKCD cartoon (which I’ve shared here several times) encapsulates it very well. People literally “pour data into a machine learning system” and then “stir the pile” hoping for the results.

In the process of applying fairly complex “machine learning” algorithms, I’ve seen people not really bother about whether the analysis makes intuitive sense, or if there is “physical meaning” in what the analysis says, or if the correlations actually determine causation. It’s blind application of “run the data through a bunch of scikit learn models and accept the output”.

And this is exactly how astrology works. There are a bunch of predictor variables (position of different “planets” in various parts of the “sky”). There is the observed variable (whether some disaster happened or not, basically), which is nicely in binary format. And then some of our ancients did some data analysis on this, trying to identify combinations of predictors that predicted the output (unfortunately they didn’t have the power of statistics or computers, so in that sense the models were limited). And then they simply accepted the outputs, without challenging why it makes sense that the position of Jupiter at the time of wedding affects how your marriage will go.

So I brought up the topic of astrology and data science again recently, saying “OK after careful analysis I admit that astrology is the oldest form of data science”. “That’s not what I said”, the wife countered. “I said that data science is new age astrology, and not the other way round”.

It’s hard to argue with that!

Profit and politics

Earlier today I came across this article about data scientists on LinkedIn that I agreed with so much that I started wondering if it was simply a case of confirmation bias.

A few sentences (possibly taken out of context) from there that I agree with:

Many large companies have fallen into the trap that you need a PhD to do data science, you don’t.
There are some smart people who know a lot about a very narrow field, but data science is a very broad discipline. When these PhD’s are put in charge, they quickly find they are out of their league.
Often companies put a strong technical person in charge when they really need a strong business person in charge.
I always found the academic world more political than the corporate world and when your drive is profits and customer satisfaction, that academic mindset is more of a liability than an asset.

Back to the topic, which is the last of these sentences. This is something I’ve intended to write for 5-6 years now, since the time I started off as an independent management consultant.

During the early days I took on assignments from both for-profit and not-for-profit organisations, and soon it was very clear that I enjoyed working with for-profit organisations a lot more. It wasn’t about money – I was fairly careful in my negotiations to never underprice myself. It was more to do with processes, and interactions.

The thing in for-profit companies is that objectives are clear. While not everyone in the company has an incentive to increase the bottom-line, it is not hard to understand what they want based on what they do.

For example, in most cases a sales manager optimises for maximum sales. Financial controllers want to keep a check on costs. And so on. So as part of a consulting assignment, it’s rather easy to know who wants what, and how you should pitch your solution to different people in order to get buy-in.

With a not-for-profit it’s not that clear. While each person may have their own metrics and objectives, because the company is not for profit, these objectives and metrics need not be everything they’re optimising for.

Moreover, in the not for profit world, the lack of money or profit as an objective means you cannot differentiate yourself with efficiency or quantity. Take the example of an organisation which, for whatever reason, gets to advice a ministry on a particular subject, and does so without a fee or only for a nominal fee.

How can a competitor who possibly has a better solution to the same problem “displace” the original organisation? In the business world, this can be done by showing superior metrics and efficiency and offering to do the job at a lower cost and stuff like that. In the not-for-profit setup, you can’t differentiate on things like cost or efficiency, so the only thing you can do is to somehow provide your services in parallel and hope that the client gets it.

And then there is access. If you’re a not-for-profit consultant who has a juicy project, it is in your interest to become a gatekeeper and prevent other potential consultants from getting the same kind of access you have – for you never know if someone else who might get access through you might end up elbowing you out.

The (missing) Desk Quants of Main Street

A long time ago, I’d written about my experience as a Quant at an investment bank, and about how banks like mine were sitting on a pile of risk that could blow up any time soon.

There were two problems as I had documented then. Firstly, most quants I interacted with seemed to be solving maths problems rather than finance problems, not bothering if their models would stand the test of markets. Secondly, there was an element of groupthink, as quant teams were largely homogeneous and it was hard to progress while holding contrarian views.

Six years on, there has been no blowup, and in some sense banks are actually doing well (I mean, they’ve declined compared to the time just before the 2008 financial crisis but haven’t done that badly). There have been no real quant disasters (yes I know the Gaussian Copula gained infamy during the 2008 crisis, but I’m talking about a period after that crisis).

There can be many explanations regarding how banks have not had any quant blow-ups despite quants solving for math problems and all thinking alike, but the one I’m partial to is the presence of a “middle layer”.

Most of the quants I interacted with were “core” in the sense that they were not attached to any sales or trading desks. Banks also typically had a large cadre of “desk quants” who are directly associated with trading teams, and who build models and help with day-to-day risk management, pricing, etc.

Since these desk quants work closely with the business, they turn out to be much more pragmatic than the core quants – they have a good understanding of the market and use the models more as guiding principles than as rules. On the other hand, they bring the benefits of quantitative models (and work of the core quants) into day-to-day business.

Back during the financial crisis, I’d jokingly predicted that other industries should hire quants who were now surplus to Wall Street. Around the same time, DJ Patil et al came up with the concept of the “data scientist” and called it the “sexiest job of the 21st century”.

And so other industries started getting their own share of quants, or “data scientists” as they were now called. Nowadays its fashionable even for small companies for whom data is not critical for business to have a data science team. Being in this profession now (I loathe calling myself a “data scientist” – prefer to say “quant” or “analytics”), I’ve come across quite a few of those.

The problem I see with “data science” on “Main Street” (this phrase gained currency during the financial crisis as the opposite of Wall Street, in that it referred to “normal” businesses) is that it lacks the cadre of desk quants. Most data scientists are highly technical people who don’t necessarily have an understanding of the business they operate in.

Thanks to that, what I’ve noticed is that in most cases there is a chasm between the data scientists and the business, since they are unable to talk in a common language. As I’m prone to saying, this can go two ways – the business guys can either assume that the data science guys are geniuses and take their word for the gospel, or the business guys can totally disregard the data scientists as people who do some esoteric math and don’t really understand the world. In either case, value added is suboptimal.

It is not hard to understand why “Main Street” doesn’t have a cadre of desk quants – it’s because of the way the data science industry has evolved. Quant at investment banks has evolved over a long period of time – the Black-Scholes equation was proposed in the early 1970s. So the quants were first recruited to directly work with the traders, and core quants (at the banks that have them) were a later addition when banks realised that some quant functions could be centralised.

On the other hand, the whole “data science” growth has been rather sudden. The volume of data, cheap incrementally available cloud storage, easy processing and the popularity of the phrase “data science” have all increased well-at-a-faster rate in the last decade or so, and so companies have scrambled to set up data teams. There has simply been no time to train people who get both the business and data – and the data scientists exist like addendums that are either worshipped or ignored.

Maths, machine learning, brute force and elegance

Back when I was at the International Maths Olympiad Training Camp in Mumbai in 1999, the biggest insult one could hurl at a peer was to describe the latter’s solution to a problem as being a “brute force solution”. Brute force solutions, which were often ungainly, laboured and unintuitive were supposed to be the last resort, to be used only if one were thoroughly unable to implement an “elegant solution” to the problem.

Mathematicians love and value elegance. While they might be comfortable with esoteric formulae and the Greek alphabet, they are always on the lookout for solutions that are, at least to the trained eye, intuitive to perceive and understand. Among other things, it is the belief that it is much easier to get an intuitive understanding for an elegant solution.

When all the parts of the solution seem to fit so well into each other, with no loose ends, it is far easier to accept the solution as being correct (even if you don’t understand it fully). Brute force solutions, on the other hand, inevitably leave loose ends and appreciating them can be a fairly massive task, even to trained mathematicians.

In the conventional view, though, non-mathematicians don’t have much fondness for elegance. A solution is a solution, and a problem solved is a problem solved.

With the coming of big data and increased computational power, however, the tables are getting turned. In this case, the more mathematical people, who are more likely to appreciate “machine learning” algorithms recommend “leaving it to the system” – to unleash the brute force of computational power at the problem so that the “best model” can be found, and later implemented.

And in this case, it is the “half-blood mathematicians” like me, who are aware of complex algorithms but are unsure of letting the system take over stuff end-to-end, who bat for elegance – to look at data, massage it, analyse it and then find that one simple method or transformation that can throw immense light on the problem, effectively solving it!

The world moves in strange ways.

High dimension and low dimension data science

I’ve observed that there are two broad approaches that people take to getting information out of data. One approach is to simply throw a kitchen sink full of analytical techniques at the data. Without really trying to understand what the data looks like, and what the relationships may be, the analyst simply uses one method after another to try and get insight from the data. Along the way, a “model” will get built.

The other approach (which I’m partial to) involves understanding each variable, and relationship between variables as a first step to getting insight from the data. Here, too, a model might get built, but it will be conditional on the analyst’s view on what kind of a model might suit the data after looking at the data.

Considering that both these approaches are used by large numbers of analysts, it is highly likely that both are legitimate. Then what explains the fact that some analysts use one approach, and others use another? Having thought about it for a long time, I have a hypothesis – it depends on the kind of data being analysed. More precisely, it has to do with the dimensionality of the data.

The first approach (which one might classify as “machine learning”) works well when the data is of high dimensions – where the number of predictors that can be used for predictors is really large, of the order of thousands or larger. For example, even a seemingly low-resolution 32 by 32 pixel image, looked at as a data point, has 1024 dimensions (colour of the image at each of the 1024 pixels is a different dimension).

Moreover, in such situations, it is likely that the signal in the data doesn’t come from one, or, two, or a handful of predictors. In high dimension data science, the signal usually comes from complex interplay of data along various dimensions. And this kind of search is not something humans are fit for – it is best that the machines are left to “learn” the model by themselves, and so you get machine learning.

On the other hand, when the dimensionality of the dataset is low, it is possible (and “easy”) for an analyst to look at the interplay of factors in detail, and understand the data before going on to build the model. Doing so can help the analyst identify patterns in the data that may not be that apparent to a machine, and it is also likely that in such datasets, the signal lies with data along a small number of dimensions, where relatively simple manipulation will suffice. The low dimensionality also means that complex machine learning techniques are unlikely to contribute much in such cases.

As you might expect, from an organisational perspective, the solution is quite simple – to deploy high-dimension data scientists on high-dimension problems, and likewise with low-dimension data scientists. Since this distinction between high-dimension and low-dimension data scientists isn’t very well known, it’s quite possible that the scientists might be made to work on a problem of dimensionality that is outside of their comfort zone.

So when you have low dimensional data scientists faced with a large number of dimensions of data, you will see them use brute force to try and find signals in bivariate relationships in the data – an approach that will never work since the signal lies in a more complex interplay of dimensions.

On the other hand, when you put high dimension data scientists on a low dimension problem, you will either see them missing out on associations that a human could easily find but a machine might find hard to find, or they might unnecessarily “reduce the problem to a known problem” by generating and importing large amounts of data in order to turn it into a high dimension problem!

PS: I wanted to tweet this today but forgot. Basically, you use logistic regression when you think the signal is an “or” of conditions on some of the underlying variables. On the other hand, if you think the signal is more likely to be an “and” condition of certain conditions, then you should use decision trees!

Data Science is a Creative Profession

About a month or so back I had a long telephonic conversation with this guy who runs an offshored analytics/data science company in Bangalore. Like most other companies that are being built in the field of analytics, this follows the software services model – a large team in an offshored location, providing long-term standardised data science solutions to a client in a different “geography”.

As is usual with conversations like this one, we talked about our respective areas of work and kind of projects we take on, and soon we got to the usual bit in such conversations where we were trying to “find synergies”. Things were going swimmingly when this guy remarked that it was the first time he was coming across a freelancer in this profession. “I’ve heard of freelance designers and writers, but never freelance data scientists or analytics professionals”, he mentioned.

In a separate event I was talking to one old friend about another old friend who has set up a one-man company to do provide what is basically freelance consulting services. We reasoned that the reason this guy had set up a company rather than calling himself a freelancer given the reputation that “freelancers” (irrespective of the work they do) have – if you say you are a freelancer people think of someone smoking pot and working in a coffee shop on a Mac. If you say you are a partner or founder of a company, people imagine someone more corporate.

Now that the digression is out of the way let us get back to my conversation with the guy who runs the offshored shop. During the conversation I didn’t say much, just saying things like “what is wrong with being a freelancer in this profession”. But now that i think more about it, it is simply a function of the profession being a fundamentally creative profession.

For a large number of people, data science is simply about statistics, or “machine learning” or predictive modelling – it is about being given a problem expressed in statistical terms and finding the best possible model and model parameters for it. It is about being given a statistical problem and finding a statistical solution – I’m not saying, of course, that statistical modelling is not a creative profession – there is a fair bit of creativity involved in figuring out what kind of model to model, and picking the right model for the right data. But when you have a large team working on the problem, working effectively like an assembly line (with different people handling different parts of the solution), what you get is effectively an “assembly line solution”.

Coming back, let us look at this “a day in the life” post I wrote about a year back about a particular day in office for me. I’ve detailed in that the various kinds of problems I had to solve that day – hidden markov models and bayesian probability to writing code using dynamic programming and implementing the code in R, and then translating the solution back to the business context. Notice that when I started off working on the problem it was not known what domain the problem belonged in – it took some poking and prodding around in order to figure out the nature of the problem and the first step in solution.

And then on, it was one step leading to another, and there are two important facts to consider about each step – firstly, at each step, it wasn’t clear as to what the best class of technique was to get beyond the step – it was about exploration in order to figure out the best class of technique. Next, at no point in time was it known what the next step was going to be until the current step was solved. You can see that it is hard to do it in an assembly line fashion!

Now, you can talk about it being like a game of chess where you aren’t sure what the opponent will do, but then in chess the opponent is a rational human being, while here the “opponent” is basically the data and the patterns it shows, and there is no way to know until you try something as to how the data will react to that. So it is impossible to list out all steps beforehand and solve it – solution is an exploratory process.

And since solving a “data science problem” (as I define it, of course) is an exploratory, and thus creative, process, it is important to work in an atmosphere that fosters creativity and “thinking without thinking” (basically keep a problem in the back of your mind and then take your mind off it, and distract yourself to solve the problem). This is best done away from a traditional corporate environment – where you have to attend meetings and be liable to be disturbed by colleagues at all times, and this is why a freelance model is actually ideal! A small partnership also works – while you might find it hard to “assembly line” the problem, having someone to bounce thoughts and ideas with can have a positive impact to the creative process. Anything more like a corporate structure and you are removing the conditions necessary to foster creativity, and are in such situations more likely to come up with cookie-cutter solutions.

So unless your business model deals with doing repeatable and continuous analytical work for a client, you are better off organising yourselves in an environment that fosters creativity and not a traditional office kind of structure if you want to solve problems using data science. Then again, your mileage might vary!

Should you have an analytics team?

In an earlier post, I had talked about the importance of business people knowing numbers and numbers people knowing business, and had put in a small advertisement for my consulting services by mentioning that I know both business and numbers and work at their cusp. In this post, I take that further and analyze if it makes sense to have a dedicated analytics team.

Following the data boom, most companies have decided (rightly) that they need to do something to take advantage of all the data that they have and have created dedicated analytics teams. These teams, normally staffed with people from a quantitative or statistical background, with perhaps a few MBAs, is in charge of taking care of all the data the company has along with doing some rudimentary analysis. The question is if having such dedicated teams is effective or if it is better to have numbers-enabled people across the firm.

Having an analytics team makes sense from the point of view of economies of scale. People who are conversant with numbers are hard to come by, and when you find some, it makes sense to put them together and get them to work exclusively on numerical problems. That also ensures collaboration and knowledge sharing and that can have positive externalities.

Then, there is the data aspect. Anyone doing business analytics within a firm needs access to data from all over the firm, and if the firm doesn’t have a centralized data warehouse which houses all its data, one task of each analytics person would be to get together the data that they need for their analysis. Here again, the economies of scale of having an integrated analytics team work. The job of putting together data from multiple parts of the firm is not solved multiple times, and thus the analysts can spend more time on analyzing rather than collecting data.

So far so good. However, writing a while back I had explained that investment banks’ policies of having exclusive quant teams have doomed them to long-term failure. My contention there (including an insider view) was that an exclusive quant team whose only job is to model and which doesn’t have a view of the market can quickly get insular, and can lead to groupthink. People are more likely to solve for problems as defined by their models rather than problems posed by the market. This, I had mentioned can soon lead to a disconnect between the bank’s models and the markets, and ultimately lead to trading losses.

Extending that argument, it works the same way with non-banking firms as well. When you put together a group of numbers people and call them the analytics group, and only give them the job of building models rather than looking at actual business issues, they are likely to get similarly insular and opaque. While initially they might do well, soon they start getting disconnected from the actual business the firm is doing, and soon fall in love with their models. Soon, like the quants at big investment banks, they too will start solving for their models rather than for the actual business, and that prevents the rest of the firm from getting the best out of them.

Then there is the jargon. You say “I fitted a multinomial logistic regression and it gave me a p-value of 0.05 so this model is correct”, the business manager without much clue of numbers can be bulldozed into submission. By talking a language which most of the firm understands you are obscuring yourself, which leads to two responses from the rest. Either they deem the analytics team to be incapable (since they fail to talk the language of business, in which case the purpose of existence of the analytics team may be lost), or they assume the analytics team to be fundamentally superior (thanks to the obscurity in the language), in which case there is the risk of incorrect and possibly inappropriate models being adopted.

I can think of several solutions for this – but irrespective of what solution you ultimately adopt – whether you go completely centralized or completely distributed or a hybrid like above – the key step in getting the best out of your analytics is to have your senior and senior-middle management team conversant with numbers. By that I don’t mean that they all go for a course in statistics. What I mean is that your middle and senior management should know how to solve problems using numbers. When they see data, they should have the ability to ask the right kind of questions. Irrespective of how the analytics team is placed, as long as you ask them the right kind of questions, you are likely to benefit from their work (assuming basic levels of competence of course). This way, they can remain conversant with the analytics people, and a middle ground can be established so that insights from numbers can actually flow into business.

So here is the plug for this post – shortly I’ll be launching short (1-day) workshops for middle and senior level managers in analytics. Keep watching this space

Project Thirty – Closure

Today is the last day of my twenties. Which means Project Thirty has come to an end. I had a long list of things to do, and as the more perceptive of you would have expected from me, most of them are undone. Nevertheless, it has been a mostly positive year, and I’m glad I gave myself the year off in an attempt to find what I want to do.

The biggest positive of the last one year was that my mental illnesses (anxiety, depression, ADHD) got diagnosed and started getting treated. Yes I’m on drugs, and face severe withdrawal symptoms if I don’t take my antidepressant for a few days, but the difference these drugs have made to my life is astounding. I feel young again. I feel intelligent again. I have more purpose in life, and am back at the cocky confidence levels I last saw in 2005. I suddenly feel there’s so much for me to do, and for the first time ever, have started enjoying what I’m doing for money.

Which brings me to professional life. I decided to give myself a year to become a freelancer. I must admit I got one lucky break (one long-term reader of this blog was looking for a data science consultant for his company and I grabbed the opportunity), but I grabbed it. My improved mental state meant that I was motivated enough to do a good job of the pilot project I did for that company, and I have managed to extract what I think is a reasonable compensation for my consulting services.

There are other exciting opportunities on the horizon on the professional front, too. I’ve started teaching at Takshashila and am quite liking it (I hope my students are, too). There is so much opportunity staring at me right now that the biggest problem for me is one of prioritization rather than looking for opportunity.

There has been both progression and regression on the “extra curricular activities” front. Thanks to demands of my consulting assignment, I haven’t been getting time to practice the violin and abruptly discontinued classes two months ago. I did one awesome and rejuvenating bike trip across Rajasthan back in February but wasn’t able to follow that up even with weekend trips. I wanted to start on adventure sports but that remained a non-starter. I started preparing for a half-marathon and gave up in a month. I took a sports club membership, tried to teach myself swimming again, but have been irregular.

Personal life again has been mixed. Increasing excitement about work means less time for the family, and have been finding it hard to balance the time requirements. I seem to be putting on weight again, and now look closer to the monstrosity I was four years ago rather than the fit guy I was two years ago at the time of my wedding. I blame my expanding waistline and neckline on my travel, but that is not an excuse and I need to find an exciting way to get fit soon.

For perhaps the first time in several years my car didn’t take a knock that year, but I had two motorcycling accidents (one major and one minor) this year. The former led to the first ever broken bone in my body (the fifth metacarpal) thanks to which I don’t have a prominent fourth knuckle on my right hand. The latter led to major damages to my laptop.

My “studs and fighters” book still remains unwritten, and not a word has been added to its manuscript in the last one year. I was hoping to capstone my Project Thirty by organizing the first ever “NED Talks” but I seem to have bitten off much more than I can chew in terms of work, so that has again been postponed.

So let me take this opportunity to define my Project Thirty One. I want at least two published books by the end of the year. I want to do at least one major motorcycling trip. I need to find partners/employees and expand my consulting business. I want to travel a lot more – at least five weekend trips over the next 12 months. I want to become fit, to the size I was at the time of my wedding. Hopefully I can get weaned off anti-depressants. And I hope to resume music lessons, and start jamming. Ok let me not promise myself too much.

And I have a five-year plan too. By the time I’m thirty five, I want to have written a book on the economic history of India. Ambitious, I know. Especially for an NED Fellow like me.

Data Science and Software Engineering

I’m a data scientist. I’m good with numbers, and handling large and medium sized data sets (that doesn’t mean I’m bad at handling small data sets, of course). The work-related thing that gives me most kicks is to take a bunch of data and through a process of simple analysis, extract information out of it. To twist and turn the data, or to use management jargon “slice and dice”, and see things that aren’t visible to too many people. To formulate hypotheses, and use data to prove or disprove them. To represent data in simple but intuitive formats (i.e. graphs) so as to convey the information I want to convey.

I can count my last three jobs (including my current one) as being results of my quest to become better at data science and modeling. Unfortunately, none of these jobs have turned out particularly well (this includes my current one). The problem has been that in all these jobs, data science has been tightly coupled with software engineering, and I suck at software engineering.

Let me stop for a moment and tell you that I don’t mind programming. In fact, I love programming. I love writing code that makes my job easier, and automates things, and gives me data in formats that I desire. But I hate software engineering. Of writing code within a particular system, or framework. Or adhering to standards that someone else sets for “good code”. Of following processes and making my code usable by some dumbfuck somewhere else who wouldn’t get it if I wrote it the way I wanted. As I’d mentioned earlier, I like coding for myself. I don’t like coding for someone else. And so I suck at software engineering.

Now I wonder if it’s possible at all to decouple data science from software engineering. My instinct tells me that it should be possible. That I need not write production-level code in order to turn my data-based insights into commercially viable form. Unfortunately, in my search around the corporatosphere thus far, I haven’t been able to find something of the sort.

Which makes me wonder if I should create my own niche, rather than hoping for someone else to create it for me.