## Statistical analysis revisited – machine learning edition

Over ten years ago, I wrote this blog post that I had termed as a “lazy post” – it was an email that I’d written to a mailing list, which I’d then copied onto the blog. It was triggered by someone on the group making an off-hand comment of “doing regression analysis”, and I had set off on a rant about why the misuse of statistics was a massive problem.

Ten years on, I find the post to be quite relevant, except that instead of “statistics”, you just need to say “machine learning” or “data science”. So this is a truly lazy post, where I piggyback on my old post, to talk about the problems with indiscriminate use of data and models.

there is this popular view that if there is data, then one ought to do statistical analysis, and draw conclusions from that, and make decisions based on these conclusions. unfortunately, in a large number of cases, the analysis ends up being done by someone who is not very proficient with statistics and who is basically applying formulae rather than using a concept. as long as you are using statistics as concepts, and not as formulae, I think you are fine. but you get into the “ok i see a time series here. let me put regression. never mind the significance levels or stationarity or any other such blah blah but i’ll take decisions based on my regression” then you are likely to get into trouble.

The modern version of this is – everybody wants to do “big data” and “data science”. So if there is some data out there, people will want to draw insights from it. And since it is easy to apply machine learning models (thanks to open source toolkits such as the scikit-learn package in Python), people who don’t understand the models indiscriminately apply it on the data that they have got. So you have people who don’t really understand data or machine learning working with those, and creating models that are dangerous.

As long as people have idea of the models they are using, and the assumptions behind them, and the quality of data that goes into the models, we are fine. However, we are increasingly seeing cases of people using improper or biased data and applying models they don’t understand on top of them, that will have impact that affect the wider world.

So the problem is not with “artificial intelligence” or “machine learning” or “big data” or “data science” or “statistics”. It is with the people who use them incorrectly.

## Programming back to the 1970s

I learnt to write computer code circa 1998, at a time when resources were plenty. I had a computer of my own – an assembled desktop with a 386 processor and RAM that was measured in MBs. It wasn’t particularly powerful, but it was more than adequate to handle the programs I was trying to write.

I wasn’t trying to process large amounts of data. Even when the algorithms were complex, they weren’t that complex. Most code ran in a matter of minutes, which meant that I didn’t need to bother about getting the code right the first time round – apart from for examination purposes. I could iterate and slowly get things right.

This was markedly different from how people programmed back in the 1970s, when computing resource was scarce and people had to mostly write code on paper. Time had to be booked at computer terminals, when the code would be copied onto the computers, and then run. The amount of time it took for the code to run meant that you had to get it right the first time round. Any mistake meant standing in line at the terminal again, and further time to run  the code.

The problem was particularly dire in the USSR, where the planned economy meant that the shortages of computer resources were shorter. This has been cited as a reason as to why Russian programmers who migrated to the US were prized – they had practice in writing code that worked for the first time.

Anyway, the point of this post is that coding became progressively easier through the second half of the 20th century, when Moore’s Law was in operation, and computers became faster, smaller and significantly more abundant.

This process continues – computers continue to become better and more abundant – smartphones are nothing but computers. On the other side, however, as storage has gotten cheap and data capture has gotten easier, data sources are significantly larger now than they were a decade or two back.

So if you are trying to write code that uses a large amount of data, it means that each run can take a significant amount of time. When the data size reaches big data proportions (when it all can’t be processed on a single computer), the problem is more complex.

And in that sense, every time you want to run a piece of code, however simple it is, execution takes a long time. This has made bugs much more expensive again – the amount of time programs take to run means that you lose a lot of time in debugging and rewriting your code.

It’s like being in the 1970s all over again!

## Intermediation and the battle for data

The Financial Times reports (\$) that thanks to the rise of AliPay and WeChat’s payment system, China’s banks are losing significantly in terms of access to customer data. This is on top of the \$20Billion or so they’re losing directly in terms of fees because of these intermediaries.

But when a consumer uses Alipay or WeChat for payment, banks do not receive data on the merchant’s name and location. Instead, the bank record simply shows the recipient as Alipay or WeChat.

The loss of data poses a challenge to Chinese banks at a time when their traditional lending business is under pressure from interest-rate deregulation, rising defaults, and the need to curb loan growth following the credit binge. Big data are seen as vital to lenders’ ability to expand into new business lines.

I had written about this earlier on my blog about how intermediaries such as Swiggy or Grofers, by offering a layer between the restaurant/shop and consumer, now have access to the consumer’s data which earlier resided with the retailer.

What is interesting is that before businesses realised the value of customer data, they had plenty of access to such data and were doing little to leverage and capitalise on it. And now that people are realising the value of data, new intermediaries that are coming in are capturing the data instead.

From this perspective, the Universal Payment Interface (UPI) that launched last week is a key step for Indian banks to hold on to customer data which they could have otherwise lost to payment wallet companies.

Already, some online payments are listed on my credit card statement in the name of the payment gateway rather than in the name of the merchant, denying the credit card issuers data on the customer’s spending patterns. If the UPI can truly take off as a successor to credit cards (rather than wallets), banks can continue to harness customer data.

## Big data at HDFC Bank?

I had a bit of a creepy moment today – I must admit that, despite being a “data guy” and recommending clients to use data to make superior decisions (including customisations), it does appear creepy when you as a customer figure that your service provider has used data to customise your experience.

I’m in Barcelona, and wanted to withdraw cash from my Citibank account in India. Withdrew once, but when I wanted to withdraw more, the transaction didn’t go through (this happened multiple times, at multiple ATMs).

Frustrated, I figured that this might be due to some limits (on how much I could transact per day), and then decided to get around the limitations by transferring some money to my HDFC Bank account (since I’m carrying that debit card as well).

An hour after I’d transferred the money by IMPS, I put my HDFC Debit Card in my wallet and walk out, when I see an email from the bank informing me that my Debit Card is valid only in India, and with a link through which I could activate international transactions on it.

I’d never received such emails from HDFC Bank before, so this was surely in the “creepy” category. It might have been sent to me by the bank at “random”, but the odds of that are extremely low. So how did the bank anticipate that I might want to use my debit card here, and send me this email?

I have one possible explanation, and if this is indeed the case, I would be very very impressed with HDFC Bank. Apart from my debit card, I also have a credit card from HDFC Bank, which I’ve been using fairly regularly during my time in Europe (that my only other credit card is an AmEx, which is hardly accepted in Europe, makes this inevitable).

My last transaction on this credit card was to pay for lunch today, and so if HDFC Bank is tracking my transactions there, it knows that I’m currently in Europe (given the large number of EUR transactions recently, if not anything else).

Maybe the bank figured out that if I’m abroad, and have transferred money by IMPS (which implies urgency) into my account, then it is for the purpose of using my debit card here? And hence they sent me the email?

The counterargument to this is that this is not the first time I’ve IMPSd to my HDFC Bank account during this trip – the Income Tax and Service Tax websites don’t accept Citibank, so I routinely transfer to HDFC to make my tax payments. So my argument is not watertight.

Yet, if the above explanation as to why HDFC Bank guessed I was going to use my debit card is true, then there are several things that HDFC Bank has got right:

1. Linkage between my bank account and credit card. While I’ve associated both with the same customer ID, my experience with legacy systems in Indian financial institutions means actually associating them is really impressive
2. Tracking of my transactions on my credit card to know my whereabouts. If HDFC has done a diligent job of this, they know where exactly I’ve been over the last few months (provided I’ve used my card in these destinations of course).
3. Understanding why I use my account. While I’ve IMPSd several times in the past (as explained above), it’s all been in either the “service tax season” or “advance/self-assessment income tax season”. Mid-May is neither. So maybe HDFC Bank is guessing that this time it may not be for tax reasons?
4. Recognising I might want to use my debit card. If I’ve put money into my account and it’s not tax season, maybe they recognised I might want to use my debit card?

Maybe I might be giving them too much credit, and it just happened that the randomly sent out email came at the time when I’d just put the money into the account.

And the link they sent to enable international transactions worked! I had to use my laptop (it didn’t work on either the app or mobile web, so that’s one point deducted for them), but with a few clicks after logging into my bank account, I was able to enable the transactions!

So maybe there is reason to be impressed!

## Categorisation and tagging

Tagging offers an efficient method to both searching and for identifying customer preferences on the axis most appropriate for the customer

The traditional way to organise a retail catalogue is by means of hierarchical categorisation. If you’re selling clothes, for example, you first divide it into men’s and women’s, then into formal and casual, and then into different items of clothing and so on. With a good categorisation, each SKU will have a unique “path” down the category tree. For traditional management purposes, this kind of categorisation might be useful, but it doesn’t lend itself well to both searching and pattern recognition.

To take a personal example (note that I’m going into anecdata territory here), I’m in the market for a hooded sweatshirt, and it has been extremely hard to find. Having given up on a number of “traditional retail” stores in the “High Street” (11th Main Road, 4th Block, Jayanagar, Bangalore) close to where I stay, I decided to check online sources and they’ve left me disappointed, too.

To be more precise, I’m looking for a grey sweatshirt made with a mix of cotton and wool (“traditional sweatshirt material”) with a zipper down the front, pockets large enough to keep my hands and a hood. Of size 42. This description is as specific as it gets and I don’t imagine any brand having more than a small number of SKUs that fit this specification.

In case I were shopping offline in a well-stocked store (perhaps a “well stocked offline store” is entering mythical territory nowadays), I would  repeat the above paragraph to a store attendant (good store attendants are also very hard to find nowadays) and he/she would pick out the sweatshirts that would conform to these specifications and I would buy one of them. The question is how one can replicate this experience in online shopping.

In other words, how can we set up our online customer catalog such that it becomes easy for shoppers to search specifically for what they’re looking for. Currently, most online stores follow a “categorisation” format, where you step into two or three levels of categorisation, where you’re shown a large assortment. This, however, doesn’t allow for efficient search. Let me illustrate by my own experience this morning.

1. Amazon.in : I hit “hoodies” in the search bar, and got shown a large assortment of hoodies. I can drill deeper in terms of sleeve length, material, colour and brand. My choice of material (which I’m particular about) is not there in the given list. There are too many colour choices and I can’t simply say “grey” and be shown all greys. There is no option to say i want a zip-open front, or a cotton-wool mix. My search ends there.

2. Jabong (rumoured to be bought by Amazon shortly): I hover over “Men’s”, click on “winter wear” and then on “hoodies”. There is a large assortment of both material (cotton-wool mix not here) and brand. There are several colours available, but no way for me to tell the system I’m looking for a zip-down hoodie. I can set my price-range and size, though. Search ends at a point when there’s too much choice.

3. Flipkart: Hover over “men’s”, click “winter wear” and then sweatshirt. Price, size and brand are the only axes on which I can drill down further. The least impressive of all the sites I’ve seen. Too much choice again at a point when I end search.

4. Myntra (recently bought by Flipkart, but not yet merged): The most impressive of all sites. I hover over “Men’s” and click on sweaters and sweatshirts (one less click than Jabong or Flipkart). After I click on “sweatshirts” it gives me a “closure” option (this is the part that impresses me) where I can say I want a zippered front. No option to indicate hood or material, though.

In each of the above, it seems like the catalog has been thought up in a hierarchical format, with little attention paid to tagging. There might be some tags attached such as “brand” but these are tags that are available to every item. The key to tagging is that not all tags need to be applicable for all items. For example, “closure” (zippered or buttoned or open) is applicable only to sweatshirts. Sleeve length is applicable only to tops.

In addition to search (as illustrated above), the purpose of tagging is to identify patterns in purchases and know more about customers. The basic idea is that people’s preferences could be along several axes, and at the time of segmentation and bucketing you are not sure which axis describes the person’s preferences best. So by having a large number of tags that you assign to each SKU (this sadly is a highly manual process), you give yourself a much superior chance of getting to know the customer.

In terms of technological capability, things have advanced much in terms of getting to know the customer. For example, it is now really quick to do a Market Basket Analysis based on large numbers of bills, which helps you identify patterns in purchase. With the technology bit being easy, the key to learning more about your customers is the framework you employ to “encase” the technology. And without efficient tagging, you are giving yourself a lesser chance of categorising the customer on the right axis.

Of course for someone used to relational databases, tagging requires non-trivial methods of storage. Firstly the number of tags varies widely by item. Secondly, tags can themselves have a hierarchy, and items might not necessarily be associated with the lowest level of tag. Thirdly, tagging is useless without efficient searching, at various levels, and it is a non-trivial technological problem to solve. But while the problems are non-trivial, the solutions are well-known and advantages large enough that whether to use tags or not is a no-brainer for an organisation that wants to use data in its decision-making.

## So much for Nandan Nilekani’s big data campaign

I got a call a couple of hours back on my landline. The wife picked and was asked to transfer the call to me. When she mentioned that I was busy she was asked about what we think of Nandan Nilekani and whether we are considering voting for him. She told them that we are registered to vote in Bangalore North and hence our opinion of Nilekani doesn’t matter.

I don’t know how the Nilekani campaign team got hold of our phone number. Even if they got from some database I don’t know how they assumed we are registered to vote in Bangalore South. For ours is a bsnl landline and bsnl landlines in Bangalore have a definite pattern that most people in Bangalore are aware of.

Back before 2002 or so when landline numbers in Bangalore got their eighth digit (a leading two) the leading digit of a Bangalore number gave away the broad area.

Numbers in South Bangalore started with 6. A leading 2 meant the number was from the government office dominated areas. A leading 5 was for mg road and the north and east of the city (he cantonment area, indiranagar, koramangala etc) and a leading 3 meant it was a northwest bangalore (malleswaram to vijayanagar) number. 8 was reserved for the outskirts.

Now while Bangalore has expanded significantly these patterns are broadly in place. All you need to do to know where a number is located is to look at the second digit – a 3 there still refers to the north and west sides of the city.

Among the areas of Bangalore that make up Nilekani’s constituency the only one that has a second digit of 3 is vijayanagar (and surrounding areas including the govindrajnagar constituency). From that perspective the likelihood of a number with second digit 2 being in Nilekani’s constituency is really low. Clearly their supposed big data algorithm hasn’t picked that!!

Forget just the second digit – look further down the number. It is public information that 2352 is one of the codes of the Rajajinagar telephone exchange, and all numbers covered by that exchange lie in either bangalore north or Central!!

I wasn’t particularly convinced about Nilekani’s use of big data in the first place – it seemed like the usual media hype – now I think that while his campaign team does use data their use of it is not particularly good. The case that the team in charge of the data analysis for Nilekani lacks any domain knowledge of the city.