## Ratings revisited

Sometimes I get a bit narcissistic, and check how my book is doing. I log on to the seller portal to see how many copies have been sold. I go to the Amazon page and see what are the other books that people who have bought my book are buying (on the US store it’s Ray Dalio’s Principles, as of now. On the UK and India stores, Sidin’s Bombay Fever is the beer to my book’s diapers).

And then I check if there are new reviews of my book. When friends write them, they notify me, so it’s easy to track. What I discover when I visit my Amazon page are the reviews written by people I don’t know. And so far, most of them have been good.

So today was one of those narcissistic days, and I was initially a bit disappointed to see a new four-star review. I started wondering what this person found wrong with my book. And then I read through the review and found it to be wholly positive.

A quick conversation with the wife followed, and she pointed out that this reviewer perhaps reserves five stars for the exceptional. And then my mind went back to this topic that I’d blogged about way back in 2015 – about rating systems.

The “4.8” score that Amazon gives as an average of all the ratings on my book so far is a rather crude measure – since one reviewer’s 4* rating might differ significantly from another reviewer’s.

For example, my “default rating” for a book might be 5/5, with 4/5 reserved for books I don’t like and 3/5 for atrocious books. On the other hand, you might use the “full scale” and use 3/5 as your average rating, giving 4 for books you really like and very rarely giving a 5.

By simply taking an arithmetic average of ratings, it is possible to overstate the quality of a product that has for whatever reason been rated mostly by people with high default ratings (such a correlation is plausible). Similarly a low average rating for a product might mask the fact that it was rated by people who inherently give low ratings.

As I argue in the penultimate chapter of my book (or maybe the chapter before that – it’s been a while since I finished it), one way that platforms foster transactions is by increasing information flow between the buyer and the seller (this is one thing I’ve gotten good at – plugging my book’s name in random sentences), and one way to do this is by sharing reviews and ratings.

From this perspective, for a platform’s judgment on a product or seller (usually it’s the seller, but for products such as AirBnb, information about buyers also matters) to be credible, it is important that they be aggregated in the right manner.

One way to do this is to use some kind of a Z-score (relative to other ratings that the rater has given) and then come up with a normalised rating. But then this needs to be readjusted for the quality of the other items that this rater has rated. So you can think of some kind of a Singular Value Decomposition you can perform on ratings to find out the “true value” of a product (ok this is an achievement – using a linear algebra reference given how badly I suck in the topic).

I mean – it need not be THAT complicated, but the basic point is that it is important that platforms aggregate ratings in the right manner in order to convey accurate information about counterparties.

So after much deliberation and procrastination, I’ve finally started a newsletter. I call it “the art of data science” and the title should be self-explanatory. It’s pure unbridled opinion (the kind of which usually goes on this blog), except that I only write about one topic there.

I intend to have three sections and then a “chart of the edition” (note how cleverly I’ve named this section to avoid giving much away on the frequency of the newsletter!). This edition, though, I ended up putting too much harikathe, so I restricted to two sections before the chart.

I intend to talk a bit each edition about some philosophical part of dealing with data (this section got a miss this time), a bit on data analysis methods (I went a bit meta on this this time) and a little bit on programming languages (which I used for bitching a bit).

And that I plan to put a “chart of the edition” means I need to read newspapers a lot more, since you are much more likely to find gems (in either direction) there than elsewhere. For the first edition, I picked off a good graph I’d seen on Twitter, and it’s about Hull City!

Anyway, enough of this meta-harikathe. You can read the first edition of the newsletter here. In case you want to get it in your inbox each week/fortnight/whenever I decide to write it, then subscribe here!

And put feedback (by email, not comments here) on what you think of the newsletter!

## High dimension and low dimension data science

I’ve observed that there are two broad approaches that people take to getting information out of data. One approach is to simply throw a kitchen sink full of analytical techniques at the data. Without really trying to understand what the data looks like, and what the relationships may be, the analyst simply uses one method after another to try and get insight from the data. Along the way, a “model” will get built.

The other approach (which I’m partial to) involves understanding each variable, and relationship between variables as a first step to getting insight from the data. Here, too, a model might get built, but it will be conditional on the analyst’s view on what kind of a model might suit the data after looking at the data.

Considering that both these approaches are used by large numbers of analysts, it is highly likely that both are legitimate. Then what explains the fact that some analysts use one approach, and others use another? Having thought about it for a long time, I have a hypothesis – it depends on the kind of data being analysed. More precisely, it has to do with the dimensionality of the data.

The first approach (which one might classify as “machine learning”) works well when the data is of high dimensions – where the number of predictors that can be used for predictors is really large, of the order of thousands or larger. For example, even a seemingly low-resolution 32 by 32 pixel image, looked at as a data point, has 1024 dimensions (colour of the image at each of the 1024 pixels is a different dimension).

Moreover, in such situations, it is likely that the signal in the data doesn’t come from one, or, two, or a handful of predictors. In high dimension data science, the signal usually comes from complex interplay of data along various dimensions. And this kind of search is not something humans are fit for – it is best that the machines are left to “learn” the model by themselves, and so you get machine learning.

On the other hand, when the dimensionality of the dataset is low, it is possible (and “easy”) for an analyst to look at the interplay of factors in detail, and understand the data before going on to build the model. Doing so can help the analyst identify patterns in the data that may not be that apparent to a machine, and it is also likely that in such datasets, the signal  lies with data along a small number of dimensions, where relatively simple manipulation will suffice. The low dimensionality also means that complex machine learning techniques are unlikely to contribute much in such cases.

As you might expect, from an organisational perspective, the solution is quite simple – to deploy high-dimension data scientists on high-dimension problems, and likewise with low-dimension data scientists. Since this distinction between high-dimension and low-dimension data scientists isn’t very well known, it’s quite possible that the scientists might be made to work on a problem of dimensionality that is outside of their comfort zone.

So when you have low dimensional data scientists faced with a large number of dimensions of data, you will see them use brute force to try and find signals in bivariate relationships in the data – an approach that will never work since the signal lies in a more complex interplay of dimensions.

On the other hand, when you put high dimension data scientists on a low dimension problem, you will either see them missing out on associations that a human could easily find but a machine might find hard to find, or they might unnecessarily “reduce the problem to a known problem” by generating and importing large amounts of data in order to turn it into a high dimension problem!

PS: I wanted to tweet this today but forgot. Basically, you use logistic regression when you think the signal is an “or” of conditions on some of the underlying variables. On the other hand, if you think the signal is more likely to be an “and” condition of certain conditions, then you should use decision trees!

## When a two-by-two ruins a scatterplot

The BBC has some very good analysis of the Brexit vote (how long back was that?), using voting data at the local authority level, and correlating it with factors such as ethnicity and educational attainment.

In terms of educational attainment, there is a really nice chart, that shows the proportion of voters who voted to leave against the proportion of population in the ward with at least a bachelor’s degree. One look at the graph tells you that the correlation is rather strong:

‘Source: http://www.bbc.com/news/uk-politics-38762034And then there is the two-by-two that is superimposed on this – with regions being marked off in pink and grey. The idea of the two-by-two must have been to illustrate the correlation – to show that education is negatively correlated with the “leave” vote.

But what do we see here? A majority of the points lie in the bottom left pink region, suggesting that wards with lower proportion of graduates were less likely to leave. And this is entirely the wrong message for the graph to send.

The two-by-two would have been useful had the points in the graph been neatly divided into clusters that could be arranged in a grid. Here, though, what the scatter plot shows is a nice negatively correlated linear relationship. And by putting those pink and grey boxes, the illustration is taking attention away from that relationship.

Instead, I’d simply put the scatter plot as it is, and maybe add the line of best fit, to emphasise the negative correlation. If I want to be extra geeky, I might also write down the $R^2$ next to the line, to show the extent of correlation!

## Medium stats

So Medium sends me this email:

Congratulations! You are among the top 10% of readers and writers on Medium this year. As a small thank you, we’ve put together some highlights from your 2016.

Now, I hardly use Medium. I’ve maybe written one post there (a long time ago) and read only a little bit (blogs I really like I’ve put on RSS and read on Feedly). So when Medium tells me that I, who considers myself a light user, is “in the top 10%”, they’re really giving away the fact that the quality of usage on their site is pretty bad.

Sometimes it’s bloody easy to see through flattery! People need to be more careful on what the stats they’re putting out really convey!

## Quantifying life

During a casual conversation on Monday, the wife remarked that given my interests and my profession (where I mostly try to derive insights from data), she was really surprised that I had never tried using data to optimise my own life.

This is a problem I’ve had in the past – I can look at clients’ data and advise them on how exactly to build their business, but I’m thoroughly incapable of doing similar analysis of my own business. I berate people for not using data and relying too much on “gut”, but “gut” is what I use for most of my own life decisions.

With this contradiction in mind, it made sense for me to start quantifying my life. Except that I didn’t know where to start. The first thing you think of when you want to do something new is to buy new gadgets for it, and I quickly asked the wife to pick up a Fitbit for me on her way back from the US next month. She would have none of it – I should use the tools that I have, she said.

I’ve tried logging stuff and writing diaries in the past but it’s mostly been tedious business (unless I’ve had to write my diary free form, which I’ve quite liked). A couple of days is all that most logs have lasted before I’ve lost interest. I hate making checklists (looking at them psyches me out), I maintain my calendar in my head (thus wasting precious memory space) and had nightmares writing notes in school.

A couple of times when I’ve visited dieticians or running coaches I’ve been asked to make a log of what I’ve been eating, and I’ve never been able to do it for more than one meal – there is too much ambiguity in the data (a “cup of dal” can mean several things) to be entered which makes the data entry process tedious.

This time, however, I’m quite bullish about maintaining the log that the wife has created for me. Helpfully, it’s on Google Docs, so I can access it on the move. More importantly, she has structured the sheet in a way that there is no fatigue in entry. The number of columns is more than what I would have liked, but having used it for two days so far, I don’t see why I should be tired of this.

The key is the simplicity of questions, and amount of effort required to fill them in. Most questions are straightforward (“what time did you wake up?” “what time did you have breakfast” etc.) and have deterministic answers. There are subjective questions (“quality of pre-lunch work”) but the wife has designed them such that I only need to enter a rating (she had put in a 3-point Likert scale which I changed to a 5-point Likert scale since I found the latter more useful here).

There are no essays. No comments. Very little ambiguity on how I should fill. And minimal judgment required.

I might be jumping to conclusions already (it’s been but two days since I started filling), but the design of this questionnaire holds important lessons in how to design a survey or questionnaire in order to get credible.
1. Keep things simple
2. Reduce subjectivity as much as possible
3. Don’t tax the filler’s mind too much. The less the mental effort required the better.
4. Account for NED. Don’t make the questionnaire too long else it causes fatigue. My instructions to the wife was that the questionnaire should be small enough to fit in my browser window (when viewed on computer). This would have limited the questions to 11 but she’s put 14, which is still not too bad.

The current plan is to collect data over the next 45 days after which we will analyse it. I may or may not share the results of the analysis here. But I’ll surely recommend my wife’s skills in designing questionnaires! Maybe she should take a hint from this in terms of her post-MBA career.

## Restaurants, deliveries and data

Delivery aggregators are moving customer data away from the retailer, who now has less knowledge about his customer.

Ever since data collection and analysis became cheap (with cloud-based on-demand web servers and MapReduce), there have been attempts to collect as much data as possible and use it to do better business. I must admit to being part of this racket, too, as I try to convince potential clients to hire me so that I can tell them what to do with their data and how.

And one of the more popular areas where people have been trying to use data is in getting to “know their customer”. This is not a particularly new exercise – supermarkets, for example, have been offering loyalty cards so that they can correlate purchases across visits and get to know you better (as part of a consulting assignment, I once sat with my clients looking at a few supermarket bills. It was incredible how much we humans could infer about the customers by looking at those bills).

The recent tradition (after it has become possible to analyse large amounts of data) is to capture “loyalties” across several stores or brands, so that affinities can be tracked across them and customer can be understood better. Given data privacy issues, this has typically been done by third party agents, who then sell back the insights to the companies whose data they collect. An early example of this is Payback, which links activities on your ICICI Bank account with other products (telecom providers, retailers, etc.) to gain superior insights on what you are like.

Nowadays, with cookie farming on the web, this is more common, and you have sites that track your web cookies to figure out correlations between your activities, and thus infer your lifestyle, so that better advertisements can be targeted at you.

In the last two or three years, significant investments have been made by restaurants and retailers to install devices to get to know their customers better. Traditional retailers are being fitted with point-of-sale devices (provision of these devices is a highly fragmented market). Restaurants are trying to introduce loyalty schemes (again a highly fragmented market). This is all an attempt to better get to know the customer. Except that middlemen are ruining it.

I’ve written a fair bit on middleman apps such as Grofers or Swiggy. They are basically delivery apps, which pick up goods for you from a store and deliver it to your place. A useful service, though as I suggest in my posts linked above, probably overvalued. As the share of a restaurant or store’s business goes to such intermediaries, though, there is another threat to the restaurant – lack of customer data.

When Grofers buys my groceries from my nearby store, it is unlikely to tell the store who it is buying for. Similarly when Swiggy buys my food from a restaurant. This means loyalty schemes of these sellers will go for a toss. Of course not offering the same loyalty program to delivery companies is a no-brainer. But what the sellers are also missing out on is the customer data that they would have otherwise captured (had they sold directly to the customer).

A good thing about Grofers or Swiggy is that they’ve hit the market at a time when sellers are yet to fully realise the benefits of capturing customer data, so they may be able to capture such data for cheap, and maybe sell it back to their seller clients. Yet, if you are a retailer who is selling to such aggregators and you value your customer data, make sure you get your pound of flesh from these guys.