When a two-by-two ruins a scatterplot

The BBC has some very good analysis of the Brexit vote (how long back was that?), using voting data at the local authority level, and correlating it with factors such as ethnicity and educational attainment.

In terms of educational attainment, there is a really nice chart, that shows the proportion of voters who voted to leave against the proportion of population in the ward with at least a bachelor’s degree. One look at the graph tells you that the correlation is rather strong:

‘Source: http://www.bbc.com/news/uk-politics-38762034And then there is the two-by-two that is superimposed on this – with regions being marked off in pink and grey. The idea of the two-by-two must have been to illustrate the correlation – to show that education is negatively correlated with the “leave” vote.

But what do we see here? A majority of the points lie in the bottom left pink region, suggesting that wards with lower proportion of graduates were less likely to leave. And this is entirely the wrong message for the graph to send.

The two-by-two would have been useful had the points in the graph been neatly divided into clusters that could be arranged in a grid. Here, though, what the scatter plot shows is a nice negatively correlated linear relationship. And by putting those pink and grey boxes, the illustration is taking attention away from that relationship.

Instead, I’d simply put the scatter plot as it is, and maybe add the line of best fit, to emphasise the negative correlation. If I want to be extra geeky, I might also write down the R^2 next to the line, to show the extent of correlation!


Financial inclusion and cash

Varad Pande and Nirat Bhatnagar have an interesting Op-Ed today in Mint about financial inclusion, and about how financial institutions haven’t been innovative to make products that are suited to the poor, and how better user interface can also drive financial inclusion. I found this example they took rather interesting:

Take, for instance, a daily wager who makes Rs200 on the days she gets work. Work is unpredictable, and expenses too can be volatile, so she has to borrow money for buying vegetables, or to pay the doctor’s fees when her children fall sick. Her real need is for a flexible—small ticket, variable amount, rapid approval—loan product that she can access instantly. Unfortunately, no institutional channel—neither the public sector bank where she has a “no frills” account, nor the MFI that she has previously borrowed from—offers such a product. She ends up borrowing from neighbours, often from the local moneylender.

Now, based on my experience in FinTech, it is not hard to design a loan product for someone whose cash flows are known. The bank statement is nothing but a continuing story of the account holder’s life, and if you can understand the cash flows (both in and out) for a reasonable period of time, it is straightforward to design a loan product that fits that cash flow pattern.

The key thing, however, is that you need to have full information on transactions, in terms of when cash comes in and goes out, what the cash outflow is used for, and all that. And that is where the cash economy is a bit of a bummer.

For a banker who is trying to underwrite, and decide the kind of loan product (and interest rate) to offer to a customer, the customer’s cash transactions obscure information; information that could’ve been used by the bank to design/structure/recommend the appropriate product for the customer.

For the case that Pande and Bhatnagar take, if all inflows and outflows are in cash, there is little beyond the potential borrower’s word that can convince bankers of the borrower’s creditworthiness. And so the potential borrower is excluded from the system.

If, on the other hand, the potential borrower were to have used non-cash means for all her transactions, bankers would have had a full picture of her life, and would have been able to give her an appropriate loan!

In this sense, I think so far financial inclusion has been going on ass-backwards, with most microfinance institutions (MFIs) targeting loans rather than deposits. And with little data to base credit on, it’s resulted in wide credit spreads and interest rates that might be seen as usurious.

Instead, if banks and MFIs had gone the other way, first getting customers to deposit, and then use the bank account for as much of their transactions as possible, it would have been possible to design much better financial products, and include more customers!

The current disruption in the cash economy possibly offers banks and MFIs a good chance to rectify their errors so far!

Intermediation and the battle for data

The Financial Times reports ($) that thanks to the rise of AliPay and WeChat’s payment system, China’s banks are losing significantly in terms of access to customer data. This is on top of the $20Billion or so they’re losing directly in terms of fees because of these intermediaries.

But when a consumer uses Alipay or WeChat for payment, banks do not receive data on the merchant’s name and location. Instead, the bank record simply shows the recipient as Alipay or WeChat.

The loss of data poses a challenge to Chinese banks at a time when their traditional lending business is under pressure from interest-rate deregulation, rising defaults, and the need to curb loan growth following the credit binge. Big data are seen as vital to lenders’ ability to expand into new business lines.

I had written about this earlier on my blog about how intermediaries such as Swiggy or Grofers, by offering a layer between the restaurant/shop and consumer, now have access to the consumer’s data which earlier resided with the retailer.

What is interesting is that before businesses realised the value of customer data, they had plenty of access to such data and were doing little to leverage and capitalise on it. And now that people are realising the value of data, new intermediaries that are coming in are capturing the data instead.

From this perspective, the Universal Payment Interface (UPI) that launched last week is a key step for Indian banks to hold on to customer data which they could have otherwise lost to payment wallet companies.

Already, some online payments are listed on my credit card statement in the name of the payment gateway rather than in the name of the merchant, denying the credit card issuers data on the customer’s spending patterns. If the UPI can truly take off as a successor to credit cards (rather than wallets), banks can continue to harness customer data.

Mike Hesson and cricket statistics

While a lot is made of the use of statistics in cricket, my broad view based on presentation of statistics in the media and the odd player/coach interview is that cricket hasn’t really learnt how to use statistics as it should. A lot of so-called insights are based on small samples, and coaches such as Peter Moores have been pilloried for their excess focus on data.

In this context, I found this interview with New Zealand coach Mike Hesson in ESPNCricinfo rather interesting. From my reading of the interview, he seems to “get” data and how to use it, and helps explain the general over-performance to expectations of the New Zealand cricket team in the last few years.

Some snippets:

You’re trying to look at trends rather than chuck a whole heap of numbers at players.

For example, if you look at someone like Shikhar Dhawan, against offspin, he’s struggled. But you’ve only really got a nine or ten-ball sample – so you’ve got to make a decision on whether it’s too small to be a pattern

Also, players take a little while to develop. You’re trying to select the player for what they are now, rather than what their stats suggest over a two or three-year period.

And there are times when you have to revise your score downwards. In our first World T20 match, in Nagpur, we knew it would slow up,


Go ahead and read the whole thing.

Restaurants, deliveries and data

Delivery aggregators are moving customer data away from the retailer, who now has less knowledge about his customer. 

Ever since data collection and analysis became cheap (with cloud-based on-demand web servers and MapReduce), there have been attempts to collect as much data as possible and use it to do better business. I must admit to being part of this racket, too, as I try to convince potential clients to hire me so that I can tell them what to do with their data and how.

And one of the more popular areas where people have been trying to use data is in getting to “know their customer”. This is not a particularly new exercise – supermarkets, for example, have been offering loyalty cards so that they can correlate purchases across visits and get to know you better (as part of a consulting assignment, I once sat with my clients looking at a few supermarket bills. It was incredible how much we humans could infer about the customers by looking at those bills).

The recent tradition (after it has become possible to analyse large amounts of data) is to capture “loyalties” across several stores or brands, so that affinities can be tracked across them and customer can be understood better. Given data privacy issues, this has typically been done by third party agents, who then sell back the insights to the companies whose data they collect. An early example of this is Payback, which links activities on your ICICI Bank account with other products (telecom providers, retailers, etc.) to gain superior insights on what you are like.

Nowadays, with cookie farming on the web, this is more common, and you have sites that track your web cookies to figure out correlations between your activities, and thus infer your lifestyle, so that better advertisements can be targeted at you.

In the last two or three years, significant investments have been made by restaurants and retailers to install devices to get to know their customers better. Traditional retailers are being fitted with point-of-sale devices (provision of these devices is a highly fragmented market). Restaurants are trying to introduce loyalty schemes (again a highly fragmented market). This is all an attempt to better get to know the customer. Except that middlemen are ruining it.

I’ve written a fair bit on middleman apps such as Grofers or Swiggy. They are basically delivery apps, which pick up goods for you from a store and deliver it to your place. A useful service, though as I suggest in my posts linked above, probably overvalued. As the share of a restaurant or store’s business goes to such intermediaries, though, there is another threat to the restaurant – lack of customer data.

When Grofers buys my groceries from my nearby store, it is unlikely to tell the store who it is buying for. Similarly when Swiggy buys my food from a restaurant. This means loyalty schemes of these sellers will go for a toss. Of course not offering the same loyalty program to delivery companies is a no-brainer. But what the sellers are also missing out on is the customer data that they would have otherwise captured (had they sold directly to the customer).

A good thing about Grofers or Swiggy is that they’ve hit the market at a time when sellers are yet to fully realise the benefits of capturing customer data, so they may be able to capture such data for cheap, and maybe sell it back to their seller clients. Yet, if you are a retailer who is selling to such aggregators and you value your customer data, make sure you get your pound of flesh from these guys.

On apps tracking you and turning you into “lab rats”

Tech2, a division of FirstPost, reports that “Facebook could be tracking all rainbow profile pictures“. In what I think is a nonsensical first paragraph, the report says:

Facebook’s News Feed experiment received a huge blow from its social media networkers. With the new rainbow coloured profile picture that celebrates equality of marriage turned us into ‘lab rats’ again? Facebook is probably tracking all those who are using its new tool to change the profile picture, believes The Atlantic.

I’m surprised things like this still makes news. It is a feature (not a bug) of any good organisation that it learns from its user interactions and user behaviour, and hence tracking how users respond to certain kinds of news or updates is a fundamental part of how Facebook should behave.

And Facebook is a company that constantly improves and updates the algorithm it uses in order to decide what updates to show whom. And to do that, it needs to maintain data on who liked what, commented on what, and turned off what kind of updates. Collecting and maintaining and analysing such data is a fundamental, and critical, part of Facebook’s operations, and expecting them not to do so is downright silly (and it would be a downright silly act on part of the management if they stop experimenting or collecting data).

Whenever you sign on to an app or a service, you need to take it as a given that the app is collecting data and information from you. And that if you are not comfortable with this kind of data capture, you are better off not using the app. Of course, network effects mean that it is not that easy to live like you did in “the world until yesterday”.

This seems like yet another case of Radically Networked Outrage by outragers not having enough things to outrage about.

Getting counted

So I got counted yesterday. And my caste was also noted. This was part of the caste census that is being currently conducted by the state government. I had a few pertinent observations on the questions and procedure, so thought I should write them down here.

  • The survey team consisted of a man and a woman. I wonder if the gender combination of the surveyors was chosen deliberately, in order to avoid awkwardness in either direction depending upon who opens the doorbell.
  • Anyway, the man seemed to be the senior person and didn’t speak much. The woman had an extraordinarily large “exam pad” (of A2 size if I’m not wrong), with a sheaf of papers where she would note down the answers.
  • So after “normal details” such as names and address, the survey proceeded directly to the caste. “Do I have to answer that?”, I asked. They said I didn’t have a choice. I told them and the lady noted it down. Then I was asked about my subcaste. I again asked if I should answer. The woman said yes, but the man overruled and they moved on.
  • There were other demographic questions involved, which I don’t remember answering during the “general census” four years ago. Stuff like the age at which we got married and the age at which we joined school.

    Anyone who can get their hands on the raw data can have a field day looking at the correlations. Like – what is the distribution of age of marriage by caste? etc.

  • The questionnaire was a pretty long one. A cursory search indicates there are a total of 55 questions. I was also asked about household assets. “How many TVs?” “One” “How many computers?” “Hmm.. Five”. “Laptops?” “I already counted them among computers” and so on.
  • I got asked about my family income also. Now you can see that all this data along with caste can be used to form some interesting correlations.
  • And it doesn’t stop there. This is the first time in a census that I’ve been asked for an identity proof. “We want both voter ID card and Aadhaar”, the lady said. I showed our voter IDs, and she noted down the number. Remember that this is a “caste census”? You know where this might be going now. I told them I couldn’t find my Aadhaar, and they said it was okay.
  • As the lady ploughed through the multiple pages of the extra-large form, I couldn’t help wondering as to why the surveyors couldn’t have been given tablets instead – in terms of the repeated efforts of first filling up the forms and then having to enter the data into the database. Given the size of the form and difficulty of carrying, it would have done the surveyors a huge favour.
  • Finally, the form was filled with pencil. Ostensibly this was so that if they made any errors in entry, they could correct immediately. I’m assuming there are no “convenient alterations” made.

Thinking about it now (I didn’t think yesterday) I’ve perhaps given away more information than I should have (voter ID number, etc.), and might have compromised on my privacy. I hope, however, that I’m one of those people who gets access to the raw database once it’s compiled (obviously much much easier said than done), given the kind of data that has been collected and the insights that can be drawn from it.

If you are a politician who gets your hands on this data, and want to use that to build your election strategy, you should hire me. There is a wealth of information in this data!