Yet another “big data whisky”

A long time back I had used a primitive version of my Single Malt recommendation app to determine that I’d like Ardbeg. Presently, the wife was travelling to India from abroad, and she got me a bottle. We loved it.

And so I had screenshots from my app stored on my phone all the time, to be used while at duty frees, so I would know what whiskies to buy.

And then about a year back, we started planning a visit to Scotland. If you remember, we were living in London then, and my wife’s cousin and her family were going to visit us over Christmas. And the plan was to go to the Scottish Highlands for a few days. And that had to include a distillery tour.

Out came my app again, to determine which distillery to visit. I had made a scatter plot (which I have unfortunately lost since) with the distance from Inverness (where we were going to be based) on one axis, and the likelihood of my wife and I liking a whisky (based on my app) on the other (by this time, Ardbeg was firmly in the “calibration set”).

The clear winner was Clynelish – it was barely 100 kilometers away from Inverness, promised a nice drive there, and had a very high similarity score to the stuff that we liked. I presently called them to make a booking for a distillery tour. The only problem was that it’s a Diageo distillery, and Diageo distillery doesn’t allow kids inside (we were travelling with three of them).

I was proud of having planned my vacation “using data science”. I had made up a blog post in my head that I was going to write after the vacation. I was basically picturing “turning around to the umpire and shouting ‘howzzat'”. And then my hopes were dashed.

A week after I had made the booking, I got a call back from the distillery informing me that it was unfortunately going to be closed during our vacation, and so we couldn’t visit. My heart sank. We finally had to make do with two distilleries that were pretty close to Inverness, but which didn’t rate highly according to my app.

My cousin-in-law-in-law and I first visited Glen Ord, another Diageo distillery, leaving our wives and kids back in the hotel. The tour was nice, but the whisky at the distillery was rather underwhelming. The high point was the fact that Glen Ord also supplies highly peated malt to other Diageo distilleries such as Clynelish (which we couldn’t visit) and Talisker (one of my early favourites).

A day later, we went to the more family friendly Tomatin distillery, to the south of Inverness (so we could carry my daughter along for the tour. She seemed to enjoy it. The other kids were asleep in the car with their dad). The tour seemed better there, but their flagship whisky seemed flat. And then came Cu Bocan, a highly peated whisky that they produce in very limited quantities and distribute in a limited fashion.

Initially we didn’t feel anything, but then the “smoke hit from the back”. Basically the initial taste of the whisky was smooth, but as you swallowed it, the peat would hit you. It was incredibly surreal stuff. We sat at the distillery’s bar for a while downing glasses full of Cu Bocan.

The cousin-in-law-in-law quickly bought a bottle to take back to Singapore. We dithered, reasoning we could “use Amazon to deliver it to our home in London”. The muhurta for the latter never arrived, and a few months later we were on our way to India. Travelling with six suitcases and six handbags and a kid meant that we were never going to buy duty free stuff on our way home (not that Cu Bocan was available in duty free).

In any case, Clynelish is also not widely available in duty free shops, so we couldn’t have that as well for a long time. And then we found an incredibly well stocked duty free shop in Maldives, on our way back from our vacation there in August. A bottle was duly bought.

And today the auspicious event arrived for the bottle to be opened. And it’s spectacular. A very different kind of peat than Lagavulin (a bottle of which we just finished yesterday). This one hits the mouth from both the front and the back.

And I would like to call Clynelish the “new big data whisky”, having discovered it through my app, almost going there for a distillery tour, and finally tasting it a year later.

Highly recommended! And I’d highly recommend my app as well!

Cheers!

Telling stories with data

I’m about 20% through with The Verdict by Prannoy Roy and Dorab Sopariwala. It’s a fascinating book, except for one annoyance – it is full of tables that serve no purpose but to break the flow of text.

I must mention that I’m reading the book on the Kindle, which means that the tables can pose a major annoyance. Text breaks off midway through one page, and the next couple of pages involve a table or two, with several lines of text explaining what’s in the table. And then the text continues. It makes for a rather disruptive reading experience. And some of the tables have just one data point – making one wonder why it has been inserted there at all.

This is not the first book that I’ve noticed that makes this mistake. Some of the sports analytics books I’ve read in recent times, such as The Numbers Game also make the same error (I read that in print, and still had the same disruption). Bhagwati and Panagariya’s Why Growth Matters is similarly unreadable. Tables abruptly inserted into the middle of text, leading to the reader losing flow in the reading.

Telling a data story in book length is a completely different challenge to telling one in article length. And telling a story with data is a complete art form. When you’re putting a table there, you need to be able to explain why that table is important to the story – rather than putting it there just because it seems more rigorous.

Also the exact placement of the table (something that can’t be controlled well in Kindle, but is easy to fix in either HTML or print) matters –  the table should be relevant to the piece of text immediately preceding and succeeding it, in a way that it doesn’t disrupt the reader’s flow. More importantly, the table should be able to add value at that particular point – perhaps building on something that has been described in the previous paragraph.

Book length makes it harder because people don’t normally expect tables and figures to disturb their reading flow when reading something of book length. Also, the book format means that it is not always possible to insert a table at a precise point (even in print, where pagination is an issue).

So how do you tell a book length story with data? Firstly, be very stingy about the data that you want to show – anything that doesn’t immediately add value should be banished to the appendix. Even the rigour, which academics might be particular about, can be pushed to the end notes (not footnotes, since those can be disruptive to flow as well, turning pages into half pages).

Then, once you know that showing a particular table or graph is inevitable to telling the story, put it either in the beginning or the end of a chapter. This way, it doesn’t break the reader’s flow. Then, refer to individual numbers in the middle of the text without having to put the entire table in there. Unless each and every data point in the table is important, banish it to the endnotes.

One other common mistake (I did it in my piece in Forbes published yesterday) is to put a big table and not talk about it. It only seeks to confuse the reader, who starts looking for explanations for everything in the table in later parts.

I guess authors and analysts tend to get possessive. If you have worked hard to produce insights from data, you seek to share as much of it as possible. And this can mean simply dumping data all the data in the piece without a regard for what the reader will do with it.

I’m making a note to myself to not repeat this mistake in future.

The (missing) Desk Quants of Main Street

A long time ago, I’d written about my experience as a Quant at an investment bank, and about how banks like mine were sitting on a pile of risk that could blow up any time soon.

There were two problems as I had documented then. Firstly, most quants I interacted with seemed to be solving maths problems rather than finance problems, not bothering if their models would stand the test of markets. Secondly, there was an element of groupthink, as quant teams were largely homogeneous and it was hard to progress while holding contrarian views.

Six years on, there has been no blowup, and in some sense banks are actually doing well (I mean, they’ve declined compared to the time just before the 2008 financial crisis but haven’t done that badly). There have been no real quant disasters (yes I know the Gaussian Copula gained infamy during the 2008 crisis, but I’m talking about a period after that crisis).

There can be many explanations regarding how banks have not had any quant blow-ups despite quants solving for math problems and all thinking alike, but the one I’m partial to is the presence of a “middle layer”.

Most of the quants I interacted with were “core” in the sense that they were not attached to any sales or trading desks. Banks also typically had a large cadre of “desk quants” who are directly associated with trading teams, and who build models and help with day-to-day risk management, pricing, etc.

Since these desk quants work closely with the business, they turn out to be much more pragmatic than the core quants – they have a good understanding of the market and use the models more as guiding principles than as rules. On the other hand, they bring the benefits of quantitative models (and work of the core quants) into day-to-day business.

Back during the financial crisis, I’d jokingly predicted that other industries should hire quants who were now surplus to Wall Street. Around the same time, DJ Patil et al came up with the concept of the “data scientist” and called it the “sexiest job of the 21st century”.

And so other industries started getting their own share of quants, or “data scientists” as they were now called. Nowadays its fashionable even for small companies for whom data is not critical for business to have a data science team. Being in this profession now (I loathe calling myself a “data scientist” – prefer to say “quant” or “analytics”), I’ve come across quite a few of those.

The problem I see with “data science” on “Main Street” (this phrase gained currency during the financial crisis as the opposite of Wall Street, in that it referred to “normal” businesses) is that it lacks the cadre of desk quants. Most data scientists are highly technical people who don’t necessarily have an understanding of the business they operate in.

Thanks to that, what I’ve noticed is that in most cases there is a chasm between the data scientists and the business, since they are unable to talk in a common language. As I’m prone to saying, this can go two ways – the business guys can either assume that the data science guys are geniuses and take their word for the gospel, or the business guys can totally disregard the data scientists as people who do some esoteric math and don’t really understand the world. In either case, value added is suboptimal.

It is not hard to understand why “Main Street” doesn’t have a cadre of desk quants – it’s because of the way the data science industry has evolved. Quant at investment banks has evolved over a long period of time – the Black-Scholes equation was proposed in the early 1970s. So the quants were first recruited to directly work with the traders, and core quants (at the banks that have them) were a later addition when banks realised that some quant functions could be centralised.

On the other hand, the whole “data science” growth has been rather sudden. The volume of data, cheap incrementally available cloud storage, easy processing and the popularity of the phrase “data science” have all increased well-at-a-faster rate in the last decade or so, and so companies have scrambled to set up data teams. There has simply been no time to train people who get both the business and data – and the data scientists exist like addendums that are either worshipped or ignored.

When a two-by-two ruins a scatterplot

The BBC has some very good analysis of the Brexit vote (how long back was that?), using voting data at the local authority level, and correlating it with factors such as ethnicity and educational attainment.

In terms of educational attainment, there is a really nice chart, that shows the proportion of voters who voted to leave against the proportion of population in the ward with at least a bachelor’s degree. One look at the graph tells you that the correlation is rather strong:

‘Source: http://www.bbc.com/news/uk-politics-38762034And then there is the two-by-two that is superimposed on this – with regions being marked off in pink and grey. The idea of the two-by-two must have been to illustrate the correlation – to show that education is negatively correlated with the “leave” vote.

But what do we see here? A majority of the points lie in the bottom left pink region, suggesting that wards with lower proportion of graduates were less likely to leave. And this is entirely the wrong message for the graph to send.

The two-by-two would have been useful had the points in the graph been neatly divided into clusters that could be arranged in a grid. Here, though, what the scatter plot shows is a nice negatively correlated linear relationship. And by putting those pink and grey boxes, the illustration is taking attention away from that relationship.

Instead, I’d simply put the scatter plot as it is, and maybe add the line of best fit, to emphasise the negative correlation. If I want to be extra geeky, I might also write down the R^2 next to the line, to show the extent of correlation!

 

Financial inclusion and cash

Varad Pande and Nirat Bhatnagar have an interesting Op-Ed today in Mint about financial inclusion, and about how financial institutions haven’t been innovative to make products that are suited to the poor, and how better user interface can also drive financial inclusion. I found this example they took rather interesting:

Take, for instance, a daily wager who makes Rs200 on the days she gets work. Work is unpredictable, and expenses too can be volatile, so she has to borrow money for buying vegetables, or to pay the doctor’s fees when her children fall sick. Her real need is for a flexible—small ticket, variable amount, rapid approval—loan product that she can access instantly. Unfortunately, no institutional channel—neither the public sector bank where she has a “no frills” account, nor the MFI that she has previously borrowed from—offers such a product. She ends up borrowing from neighbours, often from the local moneylender.

Now, based on my experience in FinTech, it is not hard to design a loan product for someone whose cash flows are known. The bank statement is nothing but a continuing story of the account holder’s life, and if you can understand the cash flows (both in and out) for a reasonable period of time, it is straightforward to design a loan product that fits that cash flow pattern.

The key thing, however, is that you need to have full information on transactions, in terms of when cash comes in and goes out, what the cash outflow is used for, and all that. And that is where the cash economy is a bit of a bummer.

For a banker who is trying to underwrite, and decide the kind of loan product (and interest rate) to offer to a customer, the customer’s cash transactions obscure information; information that could’ve been used by the bank to design/structure/recommend the appropriate product for the customer.

For the case that Pande and Bhatnagar take, if all inflows and outflows are in cash, there is little beyond the potential borrower’s word that can convince bankers of the borrower’s creditworthiness. And so the potential borrower is excluded from the system.

If, on the other hand, the potential borrower were to have used non-cash means for all her transactions, bankers would have had a full picture of her life, and would have been able to give her an appropriate loan!

In this sense, I think so far financial inclusion has been going on ass-backwards, with most microfinance institutions (MFIs) targeting loans rather than deposits. And with little data to base credit on, it’s resulted in wide credit spreads and interest rates that might be seen as usurious.

Instead, if banks and MFIs had gone the other way, first getting customers to deposit, and then use the bank account for as much of their transactions as possible, it would have been possible to design much better financial products, and include more customers!

The current disruption in the cash economy possibly offers banks and MFIs a good chance to rectify their errors so far!

Intermediation and the battle for data

The Financial Times reports ($) that thanks to the rise of AliPay and WeChat’s payment system, China’s banks are losing significantly in terms of access to customer data. This is on top of the $20Billion or so they’re losing directly in terms of fees because of these intermediaries.

But when a consumer uses Alipay or WeChat for payment, banks do not receive data on the merchant’s name and location. Instead, the bank record simply shows the recipient as Alipay or WeChat.

The loss of data poses a challenge to Chinese banks at a time when their traditional lending business is under pressure from interest-rate deregulation, rising defaults, and the need to curb loan growth following the credit binge. Big data are seen as vital to lenders’ ability to expand into new business lines.

I had written about this earlier on my blog about how intermediaries such as Swiggy or Grofers, by offering a layer between the restaurant/shop and consumer, now have access to the consumer’s data which earlier resided with the retailer.

What is interesting is that before businesses realised the value of customer data, they had plenty of access to such data and were doing little to leverage and capitalise on it. And now that people are realising the value of data, new intermediaries that are coming in are capturing the data instead.

From this perspective, the Universal Payment Interface (UPI) that launched last week is a key step for Indian banks to hold on to customer data which they could have otherwise lost to payment wallet companies.

Already, some online payments are listed on my credit card statement in the name of the payment gateway rather than in the name of the merchant, denying the credit card issuers data on the customer’s spending patterns. If the UPI can truly take off as a successor to credit cards (rather than wallets), banks can continue to harness customer data.

Mike Hesson and cricket statistics

While a lot is made of the use of statistics in cricket, my broad view based on presentation of statistics in the media and the odd player/coach interview is that cricket hasn’t really learnt how to use statistics as it should. A lot of so-called insights are based on small samples, and coaches such as Peter Moores have been pilloried for their excess focus on data.

In this context, I found this interview with New Zealand coach Mike Hesson in ESPNCricinfo rather interesting. From my reading of the interview, he seems to “get” data and how to use it, and helps explain the general over-performance to expectations of the New Zealand cricket team in the last few years.

Some snippets:

You’re trying to look at trends rather than chuck a whole heap of numbers at players.

For example, if you look at someone like Shikhar Dhawan, against offspin, he’s struggled. But you’ve only really got a nine or ten-ball sample – so you’ve got to make a decision on whether it’s too small to be a pattern

Also, players take a little while to develop. You’re trying to select the player for what they are now, rather than what their stats suggest over a two or three-year period.

And there are times when you have to revise your score downwards. In our first World T20 match, in Nagpur, we knew it would slow up,

 

Go ahead and read the whole thing.

Restaurants, deliveries and data

Delivery aggregators are moving customer data away from the retailer, who now has less knowledge about his customer. 

Ever since data collection and analysis became cheap (with cloud-based on-demand web servers and MapReduce), there have been attempts to collect as much data as possible and use it to do better business. I must admit to being part of this racket, too, as I try to convince potential clients to hire me so that I can tell them what to do with their data and how.

And one of the more popular areas where people have been trying to use data is in getting to “know their customer”. This is not a particularly new exercise – supermarkets, for example, have been offering loyalty cards so that they can correlate purchases across visits and get to know you better (as part of a consulting assignment, I once sat with my clients looking at a few supermarket bills. It was incredible how much we humans could infer about the customers by looking at those bills).

The recent tradition (after it has become possible to analyse large amounts of data) is to capture “loyalties” across several stores or brands, so that affinities can be tracked across them and customer can be understood better. Given data privacy issues, this has typically been done by third party agents, who then sell back the insights to the companies whose data they collect. An early example of this is Payback, which links activities on your ICICI Bank account with other products (telecom providers, retailers, etc.) to gain superior insights on what you are like.

Nowadays, with cookie farming on the web, this is more common, and you have sites that track your web cookies to figure out correlations between your activities, and thus infer your lifestyle, so that better advertisements can be targeted at you.

In the last two or three years, significant investments have been made by restaurants and retailers to install devices to get to know their customers better. Traditional retailers are being fitted with point-of-sale devices (provision of these devices is a highly fragmented market). Restaurants are trying to introduce loyalty schemes (again a highly fragmented market). This is all an attempt to better get to know the customer. Except that middlemen are ruining it.

I’ve written a fair bit on middleman apps such as Grofers or Swiggy. They are basically delivery apps, which pick up goods for you from a store and deliver it to your place. A useful service, though as I suggest in my posts linked above, probably overvalued. As the share of a restaurant or store’s business goes to such intermediaries, though, there is another threat to the restaurant – lack of customer data.

When Grofers buys my groceries from my nearby store, it is unlikely to tell the store who it is buying for. Similarly when Swiggy buys my food from a restaurant. This means loyalty schemes of these sellers will go for a toss. Of course not offering the same loyalty program to delivery companies is a no-brainer. But what the sellers are also missing out on is the customer data that they would have otherwise captured (had they sold directly to the customer).

A good thing about Grofers or Swiggy is that they’ve hit the market at a time when sellers are yet to fully realise the benefits of capturing customer data, so they may be able to capture such data for cheap, and maybe sell it back to their seller clients. Yet, if you are a retailer who is selling to such aggregators and you value your customer data, make sure you get your pound of flesh from these guys.

On apps tracking you and turning you into “lab rats”

Tech2, a division of FirstPost, reports that “Facebook could be tracking all rainbow profile pictures“. In what I think is a nonsensical first paragraph, the report says:

Facebook’s News Feed experiment received a huge blow from its social media networkers. With the new rainbow coloured profile picture that celebrates equality of marriage turned us into ‘lab rats’ again? Facebook is probably tracking all those who are using its new tool to change the profile picture, believes The Atlantic.

I’m surprised things like this still makes news. It is a feature (not a bug) of any good organisation that it learns from its user interactions and user behaviour, and hence tracking how users respond to certain kinds of news or updates is a fundamental part of how Facebook should behave.

And Facebook is a company that constantly improves and updates the algorithm it uses in order to decide what updates to show whom. And to do that, it needs to maintain data on who liked what, commented on what, and turned off what kind of updates. Collecting and maintaining and analysing such data is a fundamental, and critical, part of Facebook’s operations, and expecting them not to do so is downright silly (and it would be a downright silly act on part of the management if they stop experimenting or collecting data).

Whenever you sign on to an app or a service, you need to take it as a given that the app is collecting data and information from you. And that if you are not comfortable with this kind of data capture, you are better off not using the app. Of course, network effects mean that it is not that easy to live like you did in “the world until yesterday”.

This seems like yet another case of Radically Networked Outrage by outragers not having enough things to outrage about.

Getting counted

So I got counted yesterday. And my caste was also noted. This was part of the caste census that is being currently conducted by the state government. I had a few pertinent observations on the questions and procedure, so thought I should write them down here.

  • The survey team consisted of a man and a woman. I wonder if the gender combination of the surveyors was chosen deliberately, in order to avoid awkwardness in either direction depending upon who opens the doorbell.
  • Anyway, the man seemed to be the senior person and didn’t speak much. The woman had an extraordinarily large “exam pad” (of A2 size if I’m not wrong), with a sheaf of papers where she would note down the answers.
  • So after “normal details” such as names and address, the survey proceeded directly to the caste. “Do I have to answer that?”, I asked. They said I didn’t have a choice. I told them and the lady noted it down. Then I was asked about my subcaste. I again asked if I should answer. The woman said yes, but the man overruled and they moved on.
  • There were other demographic questions involved, which I don’t remember answering during the “general census” four years ago. Stuff like the age at which we got married and the age at which we joined school.

    Anyone who can get their hands on the raw data can have a field day looking at the correlations. Like – what is the distribution of age of marriage by caste? etc.

  • The questionnaire was a pretty long one. A cursory search indicates there are a total of 55 questions. I was also asked about household assets. “How many TVs?” “One” “How many computers?” “Hmm.. Five”. “Laptops?” “I already counted them among computers” and so on.
  • I got asked about my family income also. Now you can see that all this data along with caste can be used to form some interesting correlations.
  • And it doesn’t stop there. This is the first time in a census that I’ve been asked for an identity proof. “We want both voter ID card and Aadhaar”, the lady said. I showed our voter IDs, and she noted down the number. Remember that this is a “caste census”? You know where this might be going now. I told them I couldn’t find my Aadhaar, and they said it was okay.
  • As the lady ploughed through the multiple pages of the extra-large form, I couldn’t help wondering as to why the surveyors couldn’t have been given tablets instead – in terms of the repeated efforts of first filling up the forms and then having to enter the data into the database. Given the size of the form and difficulty of carrying, it would have done the surveyors a huge favour.
  • Finally, the form was filled with pencil. Ostensibly this was so that if they made any errors in entry, they could correct immediately. I’m assuming there are no “convenient alterations” made.

Thinking about it now (I didn’t think yesterday) I’ve perhaps given away more information than I should have (voter ID number, etc.), and might have compromised on my privacy. I hope, however, that I’m one of those people who gets access to the raw database once it’s compiled (obviously much much easier said than done), given the kind of data that has been collected and the insights that can be drawn from it.

If you are a politician who gets your hands on this data, and want to use that to build your election strategy, you should hire me. There is a wealth of information in this data!