When a two-by-two ruins a scatterplot

The BBC has some very good analysis of the Brexit vote (how long back was that?), using voting data at the local authority level, and correlating it with factors such as ethnicity and educational attainment.

In terms of educational attainment, there is a really nice chart, that shows the proportion of voters who voted to leave against the proportion of population in the ward with at least a bachelor’s degree. One look at the graph tells you that the correlation is rather strong:

‘Source: http://www.bbc.com/news/uk-politics-38762034And then there is the two-by-two that is superimposed on this – with regions being marked off in pink and grey. The idea of the two-by-two must have been to illustrate the correlation – to show that education is negatively correlated with the “leave” vote.

But what do we see here? A majority of the points lie in the bottom left pink region, suggesting that wards with lower proportion of graduates were less likely to leave. And this is entirely the wrong message for the graph to send.

The two-by-two would have been useful had the points in the graph been neatly divided into clusters that could be arranged in a grid. Here, though, what the scatter plot shows is a nice negatively correlated linear relationship. And by putting those pink and grey boxes, the illustration is taking attention away from that relationship.

Instead, I’d simply put the scatter plot as it is, and maybe add the line of best fit, to emphasise the negative correlation. If I want to be extra geeky, I might also write down the R^2 next to the line, to show the extent of correlation!


Medium stats

So Medium sends me this email:

Congratulations! You are among the top 10% of readers and writers on Medium this year. As a small thank you, we’ve put together some highlights from your 2016.

Now, I hardly use Medium. I’ve maybe written one post there (a long time ago) and read only a little bit (blogs I really like I’ve put on RSS and read on Feedly). So when Medium tells me that I, who considers myself a light user, is “in the top 10%”, they’re really giving away the fact that the quality of usage on their site is pretty bad.

Sometimes it’s bloody easy to see through flattery! People need to be more careful on what the stats they’re putting out really convey!


Quantifying life

During a casual conversation on Monday, the wife remarked that given my interests and my profession (where I mostly try to derive insights from data), she was really surprised that I had never tried using data to optimise my own life.

This is a problem I’ve had in the past – I can look at clients’ data and advise them on how exactly to build their business, but I’m thoroughly incapable of doing similar analysis of my own business. I berate people for not using data and relying too much on “gut”, but “gut” is what I use for most of my own life decisions.

With this contradiction in mind, it made sense for me to start quantifying my life. Except that I didn’t know where to start. The first thing you think of when you want to do something new is to buy new gadgets for it, and I quickly asked the wife to pick up a Fitbit for me on her way back from the US next month. She would have none of it – I should use the tools that I have, she said.

I’ve tried logging stuff and writing diaries in the past but it’s mostly been tedious business (unless I’ve had to write my diary free form, which I’ve quite liked). A couple of days is all that most logs have lasted before I’ve lost interest. I hate making checklists (looking at them psyches me out), I maintain my calendar in my head (thus wasting precious memory space) and had nightmares writing notes in school.

A couple of times when I’ve visited dieticians or running coaches I’ve been asked to make a log of what I’ve been eating, and I’ve never been able to do it for more than one meal – there is too much ambiguity in the data (a “cup of dal” can mean several things) to be entered which makes the data entry process tedious.

This time, however, I’m quite bullish about maintaining the log that the wife has created for me. Helpfully, it’s on Google Docs, so I can access it on the move. More importantly, she has structured the sheet in a way that there is no fatigue in entry. The number of columns is more than what I would have liked, but having used it for two days so far, I don’t see why I should be tired of this.

The key is the simplicity of questions, and amount of effort required to fill them in. Most questions are straightforward (“what time did you wake up?” “what time did you have breakfast” etc.) and have deterministic answers. There are subjective questions (“quality of pre-lunch work”) but the wife has designed them such that I only need to enter a rating (she had put in a 3-point Likert scale which I changed to a 5-point Likert scale since I found the latter more useful here).

There are no essays. No comments. Very little ambiguity on how I should fill. And minimal judgment required.

I might be jumping to conclusions already (it’s been but two days since I started filling), but the design of this questionnaire holds important lessons in how to design a survey or questionnaire in order to get credible.
1. Keep things simple
2. Reduce subjectivity as much as possible
3. Don’t tax the filler’s mind too much. The less the mental effort required the better.
4. Account for NED. Don’t make the questionnaire too long else it causes fatigue. My instructions to the wife was that the questionnaire should be small enough to fit in my browser window (when viewed on computer). This would have limited the questions to 11 but she’s put 14, which is still not too bad.

The current plan is to collect data over the next 45 days after which we will analyse it. I may or may not share the results of the analysis here. But I’ll surely recommend my wife’s skills in designing questionnaires! Maybe she should take a hint from this in terms of her post-MBA career.

Restaurants, deliveries and data

Delivery aggregators are moving customer data away from the retailer, who now has less knowledge about his customer. 

Ever since data collection and analysis became cheap (with cloud-based on-demand web servers and MapReduce), there have been attempts to collect as much data as possible and use it to do better business. I must admit to being part of this racket, too, as I try to convince potential clients to hire me so that I can tell them what to do with their data and how.

And one of the more popular areas where people have been trying to use data is in getting to “know their customer”. This is not a particularly new exercise – supermarkets, for example, have been offering loyalty cards so that they can correlate purchases across visits and get to know you better (as part of a consulting assignment, I once sat with my clients looking at a few supermarket bills. It was incredible how much we humans could infer about the customers by looking at those bills).

The recent tradition (after it has become possible to analyse large amounts of data) is to capture “loyalties” across several stores or brands, so that affinities can be tracked across them and customer can be understood better. Given data privacy issues, this has typically been done by third party agents, who then sell back the insights to the companies whose data they collect. An early example of this is Payback, which links activities on your ICICI Bank account with other products (telecom providers, retailers, etc.) to gain superior insights on what you are like.

Nowadays, with cookie farming on the web, this is more common, and you have sites that track your web cookies to figure out correlations between your activities, and thus infer your lifestyle, so that better advertisements can be targeted at you.

In the last two or three years, significant investments have been made by restaurants and retailers to install devices to get to know their customers better. Traditional retailers are being fitted with point-of-sale devices (provision of these devices is a highly fragmented market). Restaurants are trying to introduce loyalty schemes (again a highly fragmented market). This is all an attempt to better get to know the customer. Except that middlemen are ruining it.

I’ve written a fair bit on middleman apps such as Grofers or Swiggy. They are basically delivery apps, which pick up goods for you from a store and deliver it to your place. A useful service, though as I suggest in my posts linked above, probably overvalued. As the share of a restaurant or store’s business goes to such intermediaries, though, there is another threat to the restaurant – lack of customer data.

When Grofers buys my groceries from my nearby store, it is unlikely to tell the store who it is buying for. Similarly when Swiggy buys my food from a restaurant. This means loyalty schemes of these sellers will go for a toss. Of course not offering the same loyalty program to delivery companies is a no-brainer. But what the sellers are also missing out on is the customer data that they would have otherwise captured (had they sold directly to the customer).

A good thing about Grofers or Swiggy is that they’ve hit the market at a time when sellers are yet to fully realise the benefits of capturing customer data, so they may be able to capture such data for cheap, and maybe sell it back to their seller clients. Yet, if you are a retailer who is selling to such aggregators and you value your customer data, make sure you get your pound of flesh from these guys.

On Uppi2’s top rating

So it appears that my former neighbour Upendra’s new magnum opus Uppi2 is currently the top rated movie on IMDB, with a rating of 9.7/10.0. The Times of India is so surprised that it has done an entire story about it, which I’ve screenshot here: Screen Shot 2015-08-17 at 8.50.33 pm

The story also mentions that another Kannada movie RangiTaranga (which I’ve reviewed here) is in third spot, with a rating of 9.4 out of 10. This might lead you to wonder why Kannada movies have suddenly turned out to be so good. The answer, however, lies in simple logic.

The first is that both are relatively new movies and hence their ratings suffer from “small sample bias”. Of course, the sample isn’t that small – Uppi2 has received 1900 votes, which is 3 times as much as its 1999 prequel Upendra. Yet, it being a new movie, only a subset of the small set of people who have watched it so far would have reviewed it.

The second is selection bias. The people who see a movie in its first week are usually the hardcore fans, and in this case it is hardcore fans of Upendra’s movies. And hardcore fans usually find it hard to have their belief shaken (a version of what I’ve written about online opinions for Mint here), and hence they all give the movie a high rating.

As time goes by, and people who are not as hardcore fans of Upendra start watching and reviewing the movie, the ratings are likely to rationalise. Finally, ratings are easy to rig, especially when samples are small. For example, an Upendra fan club might have decided to play up the movie online by voting en masse on IMDB, and pushing up its ratings. This might explain both why the movie already has 1900 ratings in four days, and most of them are extremely positive.

The solution for this is for the rating system (IMDB in this case) to pay more weightage for “verified ratings” (by people who have rated more movies in the past, for instance), or remove highly correlated ratings. Right now, the rating algorithm seems pretty naive.

Coming back to Uppi2, from what I’ve heard from people, the movie is supposed to be really good, though perhaps not 9.7 good. I plan to watch the movie in the next few days and will write a review once I do so.

Meanwhile, read this absolutely brilliant review (in Kannada) written by this guy called “Jogi”

Using all available information

In “real-life” problems, it is not necessary to use all the given data. 

My mind goes back eleven years, to the first exam in the Quantitative Methods course at IIMB. The exam contained a monster probability problem. It was so monstrous that only some two or three out of my batch of 180 could solve it. And it was monstrous because it required you to use every given piece of information (most people missed out the “X and Y are independent” statement, since this bit of information was in words, while everything else was in numbers).

In school, you get used to solving problems where you are required to use all the given information and only the given information to solve the given problem. Taken out of the school setting, however, this is not true any more. Sometimes in “real life”, you have problems where next to no data is available, for which you need to make assumptions (hopefully intelligent) and solve the problem.

And there are times  in “real life” when you are flooded with so much data that a large part of the problem solving process is in the identification of what data is actually relevant and what you can ignore. And it can often happen that different pieces of given information contradict each other and deciding upon what to use and what to ignore is critical to efficient solution, and the decision is an art form.

Yet, in the past I’ve observed that people are not happy when you don’t use all the information at your disposal. The general feeling is that ignoring information leads to a suboptimal model – one which could be bettered by including the additional information. There are several reasons, though, that one might choose to leave out information while solving a real-life problem:

  • Some pieces of available information are mutually contradictory, so taking them both into account will lead to no solution.
  • A piece of data may not add any value after taking into account the other data at hand
  • The incremental impact of a particular piece of information is so marginal that you don’t lose much by ignoring it
  • Making use of all available information can lead to increased complexity in the model, and the incremental impact of the information may not warrant this complexity
  • It might be possible to use established models if you were to use part of the information. So we lose precision for a known model. Not always recommended but done.

The important takeaway, though, is that knowing what information to use is an art, and this forms a massive difference between textbook problems and real-life problems.

Recommendations and rating systems

This is something that came out of my IIMB class this morning. We were discussing building recommendation systems, using the whisky database (check related blog posts here and here). One of the techniques of recommendation we were discussing was the “market basket analysis“, where you recommend products to people based on combinations of products that other people have been buying.

This is when one of the students popped up with the observation that market basket analysis done without “ratings” can be self-fulfilling! It was an extremely profound observation, so I made a mental note to blog about this. And I’ve told you earlier that this IIMB class that I’m teaching is good!

So the concept is that if a lot of people have been buying A and B together, then you start recommending B to buyers of A. Let us say that there are a number of people who are buying A and C, but not B, but based on our analysis that people buy A and B together, we recommend B to them. Let’s assume that they’ve taken our recommendation and bought B, which means that these people are now seen to have bought both B and C together.

Now, in case we don’t collect their feedback on B, we have no clue that they didn’t like B (let’s assume that for whatever reason buyers of C don’t like B), but in the next iteration, we see that buyers of C have been buying B, and so we start recommending B to other C buyers. And so a bad idea (recommending B to buyers of C, thanks to A) can spiral and put the confidence of our recommendation system in tatters.

Hence, it is useful to collect feedback (in the form of ratings) to items that we recommend to customers, so that these “recommended purchases” don’t end up distorting our larger data set!

Of course what I’m saying here is not definitive, and needs more work, but it is an interesting idea nevertheless and worth being communicated. There can be some easy workarounds – like not taking into account recommended products while doing the market basket analysis, or trying to find negative lists and so on.

Nevertheless, I thought this is an interesting concept and hence worth sharing.