So after much deliberation and procrastination, I’ve finally started a newsletter. I call it “the art of data science” and the title should be self-explanatory. It’s pure unbridled opinion (the kind of which usually goes on this blog), except that I only write about one topic there.

I intend to have three sections and then a “chart of the edition” (note how cleverly I’ve named this section to avoid giving much away on the frequency of the newsletter!). This edition, though, I ended up putting too much harikathe, so I restricted to two sections before the chart.

I intend to talk a bit each edition about some philosophical part of dealing with data (this section got a miss this time), a bit on data analysis methods (I went a bit meta on this this time) and a little bit on programming languages (which I used for bitching a bit).

And that I plan to put a “chart of the edition” means I need to read newspapers a lot more, since you are much more likely to find gems (in either direction) there than elsewhere. For the first edition, I picked off a good graph I’d seen on Twitter, and it’s about Hull City!

Anyway, enough of this meta-harikathe. You can read the first edition of the newsletter here. In case you want to get it in your inbox each week/fortnight/whenever I decide to write it, then subscribe here!

And put feedback (by email, not comments here) on what you think of the newsletter!

High dimension and low dimension data science

I’ve observed that there are two broad approaches that people take to getting information out of data. One approach is to simply throw a kitchen sink full of analytical techniques at the data. Without really trying to understand what the data looks like, and what the relationships may be, the analyst simply uses one method after another to try and get insight from the data. Along the way, a “model” will get built.

The other approach (which I’m partial to) involves understanding each variable, and relationship between variables as a first step to getting insight from the data. Here, too, a model might get built, but it will be conditional on the analyst’s view on what kind of a model might suit the data after looking at the data.

Considering that both these approaches are used by large numbers of analysts, it is highly likely that both are legitimate. Then what explains the fact that some analysts use one approach, and others use another? Having thought about it for a long time, I have a hypothesis – it depends on the kind of data being analysed. More precisely, it has to do with the dimensionality of the data.

The first approach (which one might classify as “machine learning”) works well when the data is of high dimensions – where the number of predictors that can be used for predictors is really large, of the order of thousands or larger. For example, even a seemingly low-resolution 32 by 32 pixel image, looked at as a data point, has 1024 dimensions (colour of the image at each of the 1024 pixels is a different dimension).

Moreover, in such situations, it is likely that the signal in the data doesn’t come from one, or, two, or a handful of predictors. In high dimension data science, the signal usually comes from complex interplay of data along various dimensions. And this kind of search is not something humans are fit for – it is best that the machines are left to “learn” the model by themselves, and so you get machine learning.

On the other hand, when the dimensionality of the dataset is low, it is possible (and “easy”) for an analyst to look at the interplay of factors in detail, and understand the data before going on to build the model. Doing so can help the analyst identify patterns in the data that may not be that apparent to a machine, and it is also likely that in such datasets, the signal  lies with data along a small number of dimensions, where relatively simple manipulation will suffice. The low dimensionality also means that complex machine learning techniques are unlikely to contribute much in such cases.

As you might expect, from an organisational perspective, the solution is quite simple – to deploy high-dimension data scientists on high-dimension problems, and likewise with low-dimension data scientists. Since this distinction between high-dimension and low-dimension data scientists isn’t very well known, it’s quite possible that the scientists might be made to work on a problem of dimensionality that is outside of their comfort zone.

So when you have low dimensional data scientists faced with a large number of dimensions of data, you will see them use brute force to try and find signals in bivariate relationships in the data – an approach that will never work since the signal lies in a more complex interplay of dimensions.

On the other hand, when you put high dimension data scientists on a low dimension problem, you will either see them missing out on associations that a human could easily find but a machine might find hard to find, or they might unnecessarily “reduce the problem to a known problem” by generating and importing large amounts of data in order to turn it into a high dimension problem!

PS: I wanted to tweet this today but forgot. Basically, you use logistic regression when you think the signal is an “or” of conditions on some of the underlying variables. On the other hand, if you think the signal is more likely to be an “and” condition of certain conditions, then you should use decision trees!


When a two-by-two ruins a scatterplot

The BBC has some very good analysis of the Brexit vote (how long back was that?), using voting data at the local authority level, and correlating it with factors such as ethnicity and educational attainment.

In terms of educational attainment, there is a really nice chart, that shows the proportion of voters who voted to leave against the proportion of population in the ward with at least a bachelor’s degree. One look at the graph tells you that the correlation is rather strong:

‘Source: then there is the two-by-two that is superimposed on this – with regions being marked off in pink and grey. The idea of the two-by-two must have been to illustrate the correlation – to show that education is negatively correlated with the “leave” vote.

But what do we see here? A majority of the points lie in the bottom left pink region, suggesting that wards with lower proportion of graduates were less likely to leave. And this is entirely the wrong message for the graph to send.

The two-by-two would have been useful had the points in the graph been neatly divided into clusters that could be arranged in a grid. Here, though, what the scatter plot shows is a nice negatively correlated linear relationship. And by putting those pink and grey boxes, the illustration is taking attention away from that relationship.

Instead, I’d simply put the scatter plot as it is, and maybe add the line of best fit, to emphasise the negative correlation. If I want to be extra geeky, I might also write down the R^2 next to the line, to show the extent of correlation!


Medium stats

So Medium sends me this email:

Congratulations! You are among the top 10% of readers and writers on Medium this year. As a small thank you, we’ve put together some highlights from your 2016.

Now, I hardly use Medium. I’ve maybe written one post there (a long time ago) and read only a little bit (blogs I really like I’ve put on RSS and read on Feedly). So when Medium tells me that I, who considers myself a light user, is “in the top 10%”, they’re really giving away the fact that the quality of usage on their site is pretty bad.

Sometimes it’s bloody easy to see through flattery! People need to be more careful on what the stats they’re putting out really convey!


Quantifying life

During a casual conversation on Monday, the wife remarked that given my interests and my profession (where I mostly try to derive insights from data), she was really surprised that I had never tried using data to optimise my own life.

This is a problem I’ve had in the past – I can look at clients’ data and advise them on how exactly to build their business, but I’m thoroughly incapable of doing similar analysis of my own business. I berate people for not using data and relying too much on “gut”, but “gut” is what I use for most of my own life decisions.

With this contradiction in mind, it made sense for me to start quantifying my life. Except that I didn’t know where to start. The first thing you think of when you want to do something new is to buy new gadgets for it, and I quickly asked the wife to pick up a Fitbit for me on her way back from the US next month. She would have none of it – I should use the tools that I have, she said.

I’ve tried logging stuff and writing diaries in the past but it’s mostly been tedious business (unless I’ve had to write my diary free form, which I’ve quite liked). A couple of days is all that most logs have lasted before I’ve lost interest. I hate making checklists (looking at them psyches me out), I maintain my calendar in my head (thus wasting precious memory space) and had nightmares writing notes in school.

A couple of times when I’ve visited dieticians or running coaches I’ve been asked to make a log of what I’ve been eating, and I’ve never been able to do it for more than one meal – there is too much ambiguity in the data (a “cup of dal” can mean several things) to be entered which makes the data entry process tedious.

This time, however, I’m quite bullish about maintaining the log that the wife has created for me. Helpfully, it’s on Google Docs, so I can access it on the move. More importantly, she has structured the sheet in a way that there is no fatigue in entry. The number of columns is more than what I would have liked, but having used it for two days so far, I don’t see why I should be tired of this.

The key is the simplicity of questions, and amount of effort required to fill them in. Most questions are straightforward (“what time did you wake up?” “what time did you have breakfast” etc.) and have deterministic answers. There are subjective questions (“quality of pre-lunch work”) but the wife has designed them such that I only need to enter a rating (she had put in a 3-point Likert scale which I changed to a 5-point Likert scale since I found the latter more useful here).

There are no essays. No comments. Very little ambiguity on how I should fill. And minimal judgment required.

I might be jumping to conclusions already (it’s been but two days since I started filling), but the design of this questionnaire holds important lessons in how to design a survey or questionnaire in order to get credible.
1. Keep things simple
2. Reduce subjectivity as much as possible
3. Don’t tax the filler’s mind too much. The less the mental effort required the better.
4. Account for NED. Don’t make the questionnaire too long else it causes fatigue. My instructions to the wife was that the questionnaire should be small enough to fit in my browser window (when viewed on computer). This would have limited the questions to 11 but she’s put 14, which is still not too bad.

The current plan is to collect data over the next 45 days after which we will analyse it. I may or may not share the results of the analysis here. But I’ll surely recommend my wife’s skills in designing questionnaires! Maybe she should take a hint from this in terms of her post-MBA career.

Restaurants, deliveries and data

Delivery aggregators are moving customer data away from the retailer, who now has less knowledge about his customer. 

Ever since data collection and analysis became cheap (with cloud-based on-demand web servers and MapReduce), there have been attempts to collect as much data as possible and use it to do better business. I must admit to being part of this racket, too, as I try to convince potential clients to hire me so that I can tell them what to do with their data and how.

And one of the more popular areas where people have been trying to use data is in getting to “know their customer”. This is not a particularly new exercise – supermarkets, for example, have been offering loyalty cards so that they can correlate purchases across visits and get to know you better (as part of a consulting assignment, I once sat with my clients looking at a few supermarket bills. It was incredible how much we humans could infer about the customers by looking at those bills).

The recent tradition (after it has become possible to analyse large amounts of data) is to capture “loyalties” across several stores or brands, so that affinities can be tracked across them and customer can be understood better. Given data privacy issues, this has typically been done by third party agents, who then sell back the insights to the companies whose data they collect. An early example of this is Payback, which links activities on your ICICI Bank account with other products (telecom providers, retailers, etc.) to gain superior insights on what you are like.

Nowadays, with cookie farming on the web, this is more common, and you have sites that track your web cookies to figure out correlations between your activities, and thus infer your lifestyle, so that better advertisements can be targeted at you.

In the last two or three years, significant investments have been made by restaurants and retailers to install devices to get to know their customers better. Traditional retailers are being fitted with point-of-sale devices (provision of these devices is a highly fragmented market). Restaurants are trying to introduce loyalty schemes (again a highly fragmented market). This is all an attempt to better get to know the customer. Except that middlemen are ruining it.

I’ve written a fair bit on middleman apps such as Grofers or Swiggy. They are basically delivery apps, which pick up goods for you from a store and deliver it to your place. A useful service, though as I suggest in my posts linked above, probably overvalued. As the share of a restaurant or store’s business goes to such intermediaries, though, there is another threat to the restaurant – lack of customer data.

When Grofers buys my groceries from my nearby store, it is unlikely to tell the store who it is buying for. Similarly when Swiggy buys my food from a restaurant. This means loyalty schemes of these sellers will go for a toss. Of course not offering the same loyalty program to delivery companies is a no-brainer. But what the sellers are also missing out on is the customer data that they would have otherwise captured (had they sold directly to the customer).

A good thing about Grofers or Swiggy is that they’ve hit the market at a time when sellers are yet to fully realise the benefits of capturing customer data, so they may be able to capture such data for cheap, and maybe sell it back to their seller clients. Yet, if you are a retailer who is selling to such aggregators and you value your customer data, make sure you get your pound of flesh from these guys.

On Uppi2’s top rating

So it appears that my former neighbour Upendra’s new magnum opus Uppi2 is currently the top rated movie on IMDB, with a rating of 9.7/10.0. The Times of India is so surprised that it has done an entire story about it, which I’ve screenshot here: Screen Shot 2015-08-17 at 8.50.33 pm

The story also mentions that another Kannada movie RangiTaranga (which I’ve reviewed here) is in third spot, with a rating of 9.4 out of 10. This might lead you to wonder why Kannada movies have suddenly turned out to be so good. The answer, however, lies in simple logic.

The first is that both are relatively new movies and hence their ratings suffer from “small sample bias”. Of course, the sample isn’t that small – Uppi2 has received 1900 votes, which is 3 times as much as its 1999 prequel Upendra. Yet, it being a new movie, only a subset of the small set of people who have watched it so far would have reviewed it.

The second is selection bias. The people who see a movie in its first week are usually the hardcore fans, and in this case it is hardcore fans of Upendra’s movies. And hardcore fans usually find it hard to have their belief shaken (a version of what I’ve written about online opinions for Mint here), and hence they all give the movie a high rating.

As time goes by, and people who are not as hardcore fans of Upendra start watching and reviewing the movie, the ratings are likely to rationalise. Finally, ratings are easy to rig, especially when samples are small. For example, an Upendra fan club might have decided to play up the movie online by voting en masse on IMDB, and pushing up its ratings. This might explain both why the movie already has 1900 ratings in four days, and most of them are extremely positive.

The solution for this is for the rating system (IMDB in this case) to pay more weightage for “verified ratings” (by people who have rated more movies in the past, for instance), or remove highly correlated ratings. Right now, the rating algorithm seems pretty naive.

Coming back to Uppi2, from what I’ve heard from people, the movie is supposed to be really good, though perhaps not 9.7 good. I plan to watch the movie in the next few days and will write a review once I do so.

Meanwhile, read this absolutely brilliant review (in Kannada) written by this guy called “Jogi”