data – Pertinent Observations

SHAP and WAR

A few months back, at work, a couple of kids in my team taught me this concept called “SHAP“. I won’t go into the technical details here (or maybe I will later on in this post), but it is basically an algo that helps us explain a machine learning model.

It was one of those concepts that I found absolutely mind-blowing, to the extent that after these guys taught this concept to me, it became the proverbial hammer, and I started looking for “nails” all around the company. I’m pretty sure I’ve abused it (SHAP I mean).

Most of the documentation of SHAP is not very good, as you might expect about something that is very deeply technical. So maybe I’ll give a brief intro here. Or maybe not – it’s been a few months since I started using and abusing it, and so I’ve forgotten the maths.

In any case, this is one of those concepts that made me incredibly happy on the day I learnt about it. Basically, to put it “in brief”, what you essentially do is to zero out an explanatory variable, and see what the model predicts with the rest of the variables. The difference between this and the actual model output, approximately speaking, is the contribution of this explanatory variable to this particular prediction.

The beauty of SHAP is that you can calculate the value for hundreds of explanatory variables and millions of observations in fairly quick time. And that’s what’s led me to use and abuse it.

In any case, I was reading something about American sport recently, and I realised that SHAP is almost exactly identical (in concept, though not in maths) to Wins Above Replacement.

WAR works the same way – a player is replaced by a hypothetical “average similar player” (the replacement), and the model calculates how much the team would have won in that case. A player’s WAR is thus the difference between the “actuals” (what the team has actually won) and the hypothetical if this particular player had been replaced by the average replacement.

This, if you think about it, is exactly similar to zeroing out the idiosyncrasies of a particular player. So – let’s say you had a machine learning model where you had to predict wins based on certain sets of features of each player (think of the features they put on those otherwise horrible spider charts when comparing footballers).

You build this model. And then to find out the contribution of a particular player, you get rid of all of this person’s features (or replace it with “average” for all data points). And then look at the prediction and how different it is from the “actual prediction”. Depending on how you look at it, it can either be SHAP or WAR.

In other words, the two concepts are pretty much exactly the same!

Bad Data Analysis

This is a post tangentially related to work, so I must point out that all views here are my own, and not views of my employer or anyone else I’m associated with

The good thing about data analysis is that it’s inherently easy to do. The bad thing about data analysis is also that it’s inherently easy to do – with increasing data democratisation in companies, it is easier than ever than pulling some data related to your hypothesis, building a few pivot tables and charts on Excel and then presenting your results.

Why is this a bad thing, you may ask – the reason is that it is rather easy to do bad data analysis. I’m never tired of telling people who ask me “what does the data say?”, “what do you want it to say? I can make it say that”. This is not a rhetorical statement. As the old saying goes, you can “take data down into the basement and torture it until it confesses to your hypothesis”.

So, for example, when I hire analysts, I don’t check as much for the ability to pull and analyse data (those can be taught) as I do for their logical thinking skills. When they do a piece of data analysis, are they able to say that it makes sense or not? Can they identify that some correlations data shows are spurious? Are they taking ratios along the correct axis (eg. “2% of Indians are below the poverty line”, versus “20% of the world’s poor is in India”)? Are they controlling for instrumental variables?

This is the real skill in analytics – are you able to draw logical and sensible conclusions from what the data says? It is no coincidence that half my team at my current job has been formally trained in economics.

One of the externalities of being a head of analytics is that you come across a lot of bad data analysis – you are yourself responsible for some of it, your team is responsible for some more and given the ease of analysing data, there is a lot from everyone else as well.

And it becomes part of your job to comment on this analysis, to draw sense from it, and to say if it makes sense or not. In most cases, the analysis itself will be immaculate – well written queries and logic / code. The problem, almost all the time, is in the logic used.

I was reading this post by Nabeel Qureshi on puzzles. There, he quotes a book on chess puzzles, and talks about the differences between how experts approach a problem compared to novices.

The lesson I found the most striking is this: there’s a direct correlation between how skilled you are as a chess player, and how much time you spend falsifying your ideas. The authors find that grandmasters spend longer falsifying their idea for a move than they do coming up with the move in the first place, whereas amateur players tend to identify a solution and then play it shortly after without trying their hardest to falsify it first. (Often amateurs, find reasons for playing the move — ‘hope chess’.)

Call this the ‘falsification ratio’: the ratio of time you spend trying to falsify your idea to the time you took coming up with it in the first place. For grandmasters, this is 4:1 — they’ll spend 1 minute finding the right move, and another 4 minutes trying to falsify it, whereas for amateurs this is something like 0.5:1 — 1 minute finding the move, 30 seconds making a cursory effort to falsify it.

It is the same in data analysis. If I think about the amount of time I spend in analysing data, a very very large percentage of it (can’t put a number since I don’t track my time) goes in “falsifying it”. “Does this correlation make sense?”; “Have I taken care of all the confounding variables?”; “Does the result hold if I take a different sample or cut of data?”. “Has the data I’m using been collected properly?”; “Are there any biases in the data that might be affecting the result?”; And so on.

It is not an easy job. One small adjustment here or there, and the entire recommendations might flip. Despite being rigorous with the whole process, you can leave in some inaccuracy. And sometimes what your data shows may not conform to the counterparty (who has much better domain knowledge)’s biases – and so you have a much harder job selling it.

And once again – when someone says “we have used data, so we have been rigorous about the process”, it is more likely that they are more wrong.

Code Density

As many of the regular readers of this blog know, I largely use R for most of my data work. There have been a few occasions when I’ve tried to use Python, but have found that I’m far less efficient in that than I am with R, and so abandoned it, despite the relative ease of putting things into production.

Now in my company, like in most companies, people use both Python and R (the team that reports to me largely uses R, everyone else largely uses Python). And while till recently I used to claim that I’m multilingual in the sense that I can read Python code fairly competently, of late I’m not sure I am. I find it increasingly difficult to parse and read production grade python code.

And now, after some experiments with ChatGPT, and exploring other people’s codes, I have an idea on why I’m finding it hard to read production-grade Python code. It has to do with “code density”.

Of late I’ve been experimenting with Spark (finally, in this job I do a lot of “big data” work – something I never had to in my consulting career prior to this). Related to this, I was reading someone’s PySpark code.

And then it hit me – the problem (rather, my problem) with Python is that it is far more verbose than R. The number of characters or lines of code required to do something in Python is far more than what you need in R (especially if you are using the tidyverse family of packages, which I do, including sparklyr for spark).

Why does the density of code matter? It is to do with aesthetics and modularity and ease of understanding.

Yesterday I was writing some code that I plan to put into production. After a few hours of coding, I looked at the code and felt disgusted with myself – it was a way too long monolithic block of code. It might have been good when I was writing it, but I knew that if I were to revisit it in a week or two, I wouldn’t be able to understand what the hell was happening there.

I’ve never worked as a professional software engineer, but with the amount of coding I’ve done, I’ve worked out what is a “reasonable length for a code block”. It’s like that apocryphal story of Indian public examiners for high school exams who evaluate history answers based on how long they are – “if they were to place an ordinary Reynolds 045 pen vertically on the sheet, the answer should be longer than that for the student to get five marks”.

An answer in a high school history exam needs to be longer than this. A code block or function should be shorter than this

It’s the reverse here. Approximately speaking, if you were to place a Reynolds pen vertically on screen (at your normal font size), no function (or code block) can be longer than the pen.

This easily approximates how much the eye can see on one normal Macbook screen (I use a massive external monitor at work, and a less massive, but equally wide, one at home). If you have to keep scrolling up and down to understand the full logic, there is a higher chance you might make mistakes, and higher difficulty for someone to understand the code.

Till recently (as in earlier this week) I would crib like crazy that people coding in Python would make their code “too modular”. That I would have to keep switching back and forth between different functions (and classes!!) to make sense of some logic, and about how that would make codes hard to debug (I still think there is a limit to how modular you can make your ETL code).

Now, however (I’m writing this on a Saturday – I’m not working today), from the code density perspective, it all makes sense to me.

The advantage of R is that because the code is far denser, you can pack in a far greater amount of logic in a Reynolds pen length of code. So over the years I’ve gotten used to having this much logic being presented to me in one chunk (without having to scroll or switch functions).

The relatively lower density of Python, however, means that the same amount of logic that would be one function in R is now split across several different functions. It is not that the writer of the code is “making this too modular” or “writing functions just for the heck of it”. It is just that their “mental Reynolds pens” doesn’t allow them to pack in more lines in a chunk or function, and Python’s density means there is only so much logic that can go in there.

As part of my undergrad, I did a course on Software Engineering (and the one thing the professor told us then was that we should NOT take up software engineering as a career – “it’s a boring job”, he had said). In that, one of the things we learnt was that in conventional software services contexts, billing would happen as a (nonlinear) function of “kilo lines of code” (this was in early 2003).

Now, looking back, one thing I can say is that the rate per kilo line of R code ought to be much higher than the rate per kilo line of Python code.

Cross posted on my now-largely-dormant Art of Data Science newsletter

The Second Great Wall (of programming)

Back in 2000, I entered the Computer Science undergrad program at IIT Madras thinking I was a fairly competent coder. In my high school, I had a pretty good reputation in terms of my programming skills and had built a whole bunch of games.

By the time half the course was done I had completely fallen out of love with programming, deciding a career in Computer Science was not for me. I even ignored Kama (current diro)’s advice and went on to do an MBA.

He had given me a 3 hour lecture about how I'm wasting my life when I happened to tell him that I was preparing for CAT. https://t.co/1rWG0BnymW

— Karthik S (@karthiks) April 19, 2023

What had happened? Basically it was a sudden increase in the steepness of the learning curve. Or that I’m a massive sucker for user experience, which the Computer Science program didn’t care for.

Back in school, my IDE of choice (rather the only one available) was TurboC, a DOS-based program. You would write your code, and then hit Ctrl+F9 to run the program. And it would just run. I didn’t have to deal with any technical issues. Looking back, we had built some fairly complex programs just using TurboC.

And then I went to IIT and found that there was no TurboC, no DOS. Most computers there had an ancient version of Unix (or worse, Solaris). These didn’t come with nice IDEs such as TurboC. Instead, you had to use vi (some of the computers were so old they didn’t even have vim) to write the code, and then compile it from outside.

Difficulties in coming to terms with vi meant that my speed of typing dropped. I couldn’t “code at the speed of thought” any more. This was the first give up moment.

Then, I discovered that C++ had now got this new set of “standard template libraries” (STL) with vectors and stuff. This was very alien to the way I had learnt C++ in school. Also I found that some of my classmates were very proficient with this, and I just couldn’t keep up with this. The effort seemed too much (and the general workload of the program was so high that I couldn’t get much time for “learning by myself”), so I gave up once again.

Next, I figured that a lot of my professors were suckers for graphic UIs (though they denied us good UX by denying us good computers). This, circa 2001-2, meant programming in Java and writing applets. It was a massive degree of complexity (and “boringness”) compared to the crisp C/C++ code I was used to writing. I gave up yet again.

I wasn’t done with giving up yet. Beyond all of this, there was “systems programming”. You had to write some network layers and stuff. You had to go deep into the workings of the computer system to get your code to run. This came rather intuitively to most of my engineering-minded classmates. It didn’t to me (programming in C was the “deepest” I could grok). And I gave up even more.

A decade after I graduated from IIT Madras, I visited IIM Calcutta to deliver a lecture. And got educated.

I did my B.Tech. project in “theoretical computer science”, managed to graduate and went on to do an MBA. Just before my MBA, I was helping my father with some work, and he figured I sucked at Excel. “What is the use of completing a B.Tech. in computer science if you can’t even do simple Excel stuff?”, he thundered.

In IIMB, all of us bought computers with pirated Windows and Office. I started using Excel. It was an absolute joy. It was a decade before I started using Apple products, but the UX of Windows was such a massive upgrade compared to what I’d seen in my more technical life.

In my first job (where I didn’t last long) I learnt the absolute joy of Visual Basic macros for Excel. This was another level of unlock. I did some insane gymnastics in that. I pissed off a lot of people in my second job by replicating what they thought was a complex model on an Excel sheet. In my third job, I replaced a guy on my team with an Excel macro. My programming mojo was back.

Goldman Sachs’s SLANG was even better. By the time I left from there, I had learnt R as well. And then I became a “data scientist”. People asked me to use Python. I struggled with it. After the user experience of R, this was too complex. This brought back bad memories of all the systems programming and dealing with bad UX I had encountered in my undergrad. This time I was in control (I was a freelancer) so I didn’t need to give up – I was able to get all my work done in R.

The second giving up

I’ve happily used R for most of my data work in the last decade. Recently at work I started using Databricks (still write my code in R there, using sparklyr), and I’m quite liking that as well. However, in the last 3-4 months there has been a lot of developments in “AI”, which I’ve wanted to explore.

The unfortunate thing is that most of this is available only in Python. And the bad UX problem is back again.

Initially I got excited, and managed to install Stable Diffusion on my personal Mac. I started writing some OpenAI code as well (largely using R). I started tracking developments in artificial intelligence, and trying some of them out.

And now, in the last 2-3 weeks, I’ve been struggling with “virtual environments”. Each newfangled open-source AI that is released comes with its own codebase and package requirements. They are all mutually incompatible. You install one package, and you break another package.

The “solution” to this, from what I could gather, is to use virtual environments – basically a sandbox for each of these things that I’ve downloaded. That, I find, is grossly inadequate. One of the points of using open source software is to experiment with connecting up two or more of them. And if each needs to be in its own sandbox, how is one supposed to do this? And how are all other data scientists and software engineers okay with this?

This whole virtual environment mess means that I’m giving up on programming once again. I won’t fully give up – I’ll continue to use R for most of my data work (including my job), but I’m on the verge of giving up in terms of these “complex AI”.

It’s the UX thing all over again. I simply can’t handle bad UX. Maybe it’s my ADHD. But when something is hard to use, I simply don’t want to use it.

And so I’m giving up once again. Tagore was right.

Recreating Tufte, and Bangalore weather

For most of my life, I pretty much haven’t understood what the point of “recreating” is. For example, in school if someone says they were going to “act out ______’s _____” I would wonder what the point of it was – that story is well known so they might as well do something more creative.

Later on in life, maybe some 12-13 years back, I discovered the joy in “retelling known stories” – since everyone knows the story you can be far more expressive in how you tell it. Still, however, just “re-creation” (not recreation) never really fascinated me. Most of the point of doing things is to do them your way, I’ve believed (and nowadays, if you think of it, most re-creating can be outsourced to a generative AI).

And the this weekend that changed. On Saturday, I made the long-pending trip to Blossom (helped that daughter had a birthday party to attend nearby), and among other things, I bought Edward Tufte’s classic “The Visual Display of Quantitative Information“. I had read a pirated PDF of this a decade ago (when I was starting out in “data science”), but always wanted the “real thing”.

And this physical copy, designed by Tufte himself, is an absolute joy to read. And I’m paying more attention to the (really beautiful) graphics. So, when I came across this chart of New York weather, I knew I had to recreate it.

A few months earlier, I had dowloaded the dataset for Bangalore’s hourly temperature and rainfall since 1981 (i.e. a bit longer than my own life). This dataset ended in November 2022, but I wasn’t concerned. Basically, this is such a large and complex dataset that so far I had been unable to come up with an easy way to visualise it. So, when I saw this thing from Tufte, recreating would be a good idea.

I spent about an hour and half yesterday doing this. I’ve ignored the colour schemes and other “aesthetic” stuff (just realised I’ve not included the right axis in my re-creation). But I do think I’ve got something fairly good.

My re-creation of Tufte’s New York weather map, in the context of Bangalore in 2022

2022 was an unusual weather year for Bangalore and it shows in this graph. May wasn’t as hot as usual, and there were some rather cold days. Bangalore recorded its coldest October and November days since the 90s (though as this graph shows, not a record by any means). It was overall a really wet year, constantly raining from May to November. The graph shows all of it.

Also if you look at the “noraml pattern” and the records, you see Bangalore’s unusual climate (yes, I do mean “climate” and not “weather” here). Thanks to the monsoons (and pre-monsoons), April is the hottest month. Summer, this year, has already started – in the afternoons it is impossible to go out now. The minimum temperatures are remarkably consistent through the year (so except early in the mornings, you pretty much NEVER need a sweater here – at least I haven’t after I moved back from London).

There is so much more I can do. I’m glad to have come across a template to analyse the data using. Whenever I get the enthu (you know what this website is called) I’ll upload my code to produce this graph onto github or something. And when I get more enthu, I’ll make it aesthetically similar to Tufte’s graph (and include December 2022 data as well).

Stable Diffusion and Chat GPT and Logistic Regression

For a long time I have had this shibboleth on whether someone is a “statistics person or a machine learning person”. It is based on what they call regressions where the dependent variable is binary. Statisticians simply call it “logit” (there is also a “probit“).

Now, in terms of implementation as well, there is one big difference between the way “logit” is modelled versus “logistic regression”. For a logit model (if you are using python, you need to use the “statsmodels” package for this, not scikit learn), the number of observations needs to far exceed the number of independent variables.

Else, a matrix that needs to be inverted as part of the solution will turn out to be singular, and there will be no solution. I guess I betrayed my greater background in statistics than in Machine Learning when, in 2018, I wrote this blogpost on machine learning being a “process to tie down coefficients in maths models“.

For “logistic regression” (as opposed to “logit”) puts no such constraint – on the regression matrix being invertible. Instead of actually inverting the matrix, machine learning approaches simply focus on learning the terms of the inverted matrix using gradient descent (basically the opposite of hill climbing), so mathematical inconveniences such as matrices that cannot be inverted are moot there.

And so you have logistic regression models with thousands of variables, often calibrated with a fewer number of data points. To be honest, I can’t understand this fully – without sufficient information (data points) to calibrate the coefficients, there will always be a sense of randomness in the output. The model has too many degrees of freedom, and so there is additional information the model is supplying (apart from what was supplied in the training data!).

Of late I have been playing a fair bit with generative AI (primarily ChatGPT and Stable Diffusion). The other day, my daughter and I were alone in my in-laws’ house, and I told her “look I’ve brought my personal laptop along, if you want we can play with it”. And she had demanded that she “play with stable diffusion”. This is the image she got for “tiger chasing deer”.

I have written earlier here about how the likes of ChatGPT and Stable Diffusion in a way redefine “information content“.

What the likes of Dall-E and Stable Diffusion have clearly shown is that a picture is worth far less than a thousand words

— Karthik S (@karthiks) December 16, 2022

And if you think about it, almost by definition, “generative AI” creates information (and hallucinates, like in the above pic). Traditionally speaking, a “picture is worth a thousand words”, but if you can generate a picture with just a few words of prompt, the information content in it is far less than a thousand words.

In some sense, this reminds me of “logistic regression” once again. By definition (because it is generative), there is insufficient “tying down of coefficients”, because of which the AI inevitably ends up “adding value of its own”, which by definition is random.

So, you will end up getting arbitrary results. ChatGPT often gives you wrong answers to questions. Dall-E and Midjourney and Stable Diffusion will return nonsense images such as the above. Because a “generative AI” needs to create information, by definition, all the coefficients of the model cannot be well calibrated.

And the consequence of this is that however good these AIs get, however much data is used to train them, there will always be an element of randomness to them. There will always be test cases where they give funny results.

No, AGI is not here yet.

More on CRM

On Friday afternoon, I got a call on my phone. It was “+91 9818… ” number, and my first instinct was it was someone at work (my company is headquartered in Gurgaon), and I mentally prepared a “don’t you know I’m on vacation? can you call me on Monday instead” as I picked the call.

It turned out to be Baninder Singh, founder of Savorworks Coffee. I had placed an order on his website on Thursday, and I half expected him to tell me that some of the things I had ordered were out of stock.

“Karthik, for your order of the Pi?anas, you have asked for an Aeropress grind. Are you sure of this? I’m asking you because you usually order whole beans”, Baninder said. This was a remarkably pertinent observation, and an appropriate question from a seller. I confirmed to him that this was indeed deliberate (this smaller package is to take to office along with my Aeropress Go), and thanked him for asking. He went on to point out that one of the other coffees I had ordered had very limited stocks, and I should consider stocking up on it.

Some people might find this creepy (that the seller knows exactly what you order, and notices changes in your order), but from a more conventional retail perspective, this is brilliant. It is great that the seller has accurate information on your profile, and is able to detect any anomalies and alert you before something goes wrong.

Now, Savorworks is a small business (a Delhi based independent roastery), and having ordered from them at least a dozen times, I guess I’m one of their more regular customers. So it’s easy for them to keep track and take care of me.

It is similar with small “mom-and-pop” stores. Limited and high-repeat clientele means it’s easy for them to keep track of them and look after them. The challenge, though, is how do you scale it? Now, I’m by no means the only person thinking about this problem. Thousands of business people and data scientists and retailers and technology people and what not have pondered this question for over a decade now. Yet, what you find is that at scale you are simply unable to provide the sort of service you can at small scale.

In theory it should be possible for an AI to profile customers based on their purchases, adds to carts, etc. and then provide them customised experiences. I’m sure tonnes of companies are already trying to do this. However, based on my experience I don’t think anyone is doing this well.

I might sound like a broken record here, but my sense is that this is because the people who are building the algos are not the ones who are thinking of solving the business problems. The algos exist. In theory, if I look at stuff like stable diffusion or Chat GPT (both of which I’ve been playing around with extensively in the last 2 days), algorithms for stuff like customer profiling shouldn’t be THAT hard. The issue, I suspect, is that people have not been asking the right questions of the algos.

On one hand, you could have business people looking at patterns they have divined themselves and then giving precise instructions to the data scientists on how to detect them – and the detection of these patterns would have been hard coded. On the other, the data scientists would have had a free hand and would have done some unsupervised stuff without much business context. And both approaches lead to easily predictable algos that aren’t particularly intelligent.

Now I’m thinking of this as a “dollar bill on the road” kind of a problem. My instinct tells me that “solution exists”, but my other instinct tells that “if a solution existed someone would have found it given how many companies are working on this kind of thing for so long”.

The other issue with such algos it that the deeper you get in prediction the harder it is. At the cohort (of hundreds of users) level, it should not be hard to profile. However, at the personal user level (at which the results of the algos are seen by customers) it is much harder to get right. So maybe there are good solutions but we haven’t yet seen it.

Maybe at some point in the near future, I’ll take another stab at solving this kind of problem. Until then, you have human intelligence and random algos.

Alcohol, dinner time and sleep

A couple of months back, I presented what I now realise is a piece of bad data analysis. At the outset, there is nothing special about this – I present bad data analysis all the time at work. In fact, I may even argue that as a head of Data Science and BI, I’m entitled to do this. Anyway, this is not about work.

In that piece, I had looked at some of the data I’ve been diligently collecting about myself for over a year, correlated it with the data collected through my Apple Watch, and found a correlation that on days I drank alcohol, my sleeping heart rate average was higher.

And so I had concluded that alcohol is bad for me. Then again, I’m an experimenter so I didn’t let that stop me from having alcohol altogether. In fact, if I look at my data, the frequency of having alcohol actually went up after my previous blog post, though for a very different reason.

However, having written this blog post, every time I drank, I would check my sleeping heart rate the next day. Most days it seemed “normal”. No spike due to the alcohol. I decided it merited more investigation – which I finished yesterday.

First, the anecdotal evidence – what kind of alcohol I have matters. Wine and scotch have very little impact on my sleep or heart rate (last year with my Ultrahuman patch I’d figured that they had very little impact on blood sugar as well). Beer, on the other hand, has a significant (negative) impact on heart rate (I normally don’t drink anything else).

Unfortunately this data point (what kind of alcohol I drank or how much I drank) I don’t capture in my daily log. So it is impossible to analyse it scientifically.

Anecdotally I started noticing another thing – all the big spikes I had reported in my previous blogpost on the topic were on days when I kept drinking (usually with others) and then had dinner very late. Could late dinner be the cause of my elevated heart rate? Again, in the days after my previous blogpost, I would notice that late dinners would lead to elevated sleeping heart rates (even if I hadn’t had alcohol that day). Looking at my nightly heart rate graph, I could see that the heart rate on these days would be elevated in the early part of my sleep.

The good news is this (dinner time) is a data point I regularly capture. So when I finally got down to revisiting the analysis yesterday, I had a LOT of data to work with. I won’t go into the intricacies of the analysis (and all the negative results) here. But here are the key insights.

If I regress my resting heart rate against the binary of whether I had alcohol the previous day, I get a significant regression, with a R^2 of 6.1% (i.e. whether I had alcohol the previous day or not explains 6.1% of the variance in my sleeping heart rate). If I have had alcohol the previous day, my sleeping heart rate is higher by about 2 beats per minute on average.

Call:
lm(formula = HR ~ Alcohol, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.6523 -2.6349 -0.3849  2.0314 17.5477 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  69.4849     0.3843 180.793  < 2e-16 ***
AlcoholYes    2.1674     0.6234   3.477 0.000645 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.957 on 169 degrees of freedom
Multiple R-squared:  0.06676,   Adjusted R-squared:  0.06123 
F-statistic: 12.09 on 1 and 169 DF,  p-value: 0.000645

Then I regressed my resting heart rate on dinner time (expressed in hours) alone. Again a significant regression but with a much higher R^2 of 9.7%. So what time I have dinner explains a lot more of the variance in my resting heart rate than whether I’ve had alcohol. And each hour later I have my dinner, my sleeping heart rate that night goes up by 0.8 bpm.

Call:
lm(formula = HR ~ Dinner, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6047 -2.4551 -0.0042  2.0453 16.7891 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  54.7719     3.5540  15.411  < 2e-16 ***
Dinner        0.8018     0.1828   4.387 2.02e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.881 on 169 degrees of freedom
Multiple R-squared:  0.1022,    Adjusted R-squared:  0.09693 
F-statistic: 19.25 on 1 and 169 DF,  p-value: 2.017e-05

Finally, for the sake of completeness, I regressed with both. The interesting thing is the adjusted R^2 pretty much added up – giving me > 16% now (so effectively the two (dinner time and alcohol) are uncorrelated). The coefficients are pretty much the same once again.

Call:
lm(formula = HR ~ Dinner, data = .)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6047 -2.4551 -0.0042  2.0453 16.7891 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  54.7719     3.5540  15.411  < 2e-16 ***
Dinner        0.8018     0.1828   4.387 2.02e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.881 on 169 degrees of freedom
Multiple R-squared:  0.1022,    Adjusted R-squared:  0.09693 
F-statistic: 19.25 on 1 and 169 DF,  p-value: 2.017e-05

So the takeaway is simple – alcohol might be okay, but have dinner at my regular time (~ 6pm). Also – if I’m going out drinking, I better finish my dinner and go. And no – having beer won’t work – it is going to be another dinner in itself. So stick to wine or scotch.

I must mention things I analysed against and didn’t find significant – whether I have coffee, what time I sleep, the time gap between dinner time and sleep time – all of these have no impact on my resting heart rate. All that matters is alcohol and when I have dinner.

And the last one is something I should never compromise on.

Mo Salah and Machine Learning

First of all, I’m damn happy that Mo Salah has renewed his Liverpool contract. With Sadio Mane also leaving, the attack was looking a bit thin (I was distinctly unhappy with the Jota-Mane-Diaz forward line we used in the Champions League final. Lacked cohesion). Nunez is still untested in terms of “leadership”, and without Salah that would’ve left Firmino as the only “attacking leader”.

(non-technical readers can skip the section in italics and still make sense of this post)

Now that this is out of the way, I’m interested in seeing one statistic (for which I’m pretty sure I don’t have the data). For each of the chances that Salah has created, I want to look at the xG (expected goals) and whether he scored or not. And then look at a density plot of xG for both categories (scored or not).

For most players, this is likely to result in two very distinct curves – they are likely to score from a large % of high xG chances, and almost not score at all from low xG chances. For Salah, though, the two density curves are likely to be a lot closer.

What I’m saying is – most strikers score well from easy chances, and fail to score from difficult chances. Salah is not like that. On the one hand, he creates and scores some extraordinary goals out of nothing (low xG). On the other, he tends to miss a lot of seemingly easy chances (high xG).

In fact, it is quite possible to look at a player like Salah, see a few sitters that he has missed (he misses quite a few of them), and think he is a poor forward. And if you look at a small sample of data (or short periods of time) you are likely to come to the same conclusion. Look at the last 3-4 months of the 2021-22 season. The consensus among pundits then was that Salah had become poor (and on Reddit, you could see Liverpool fans arguing that we shouldn’t give him a lucrative contract extension since ‘he has lost it’).

It is well possible that this is exactly the conclusion Jose Mourinho came to back in 2013-14 when he managed Salah at Chelsea (and gave him very few opportunities). The thing with a player like Salah is that he is so unpredictable that it is very possible to see samples and think he is useless.

Of late, I’ve been doing (rather, supervising (and there is no pun intended) ) a lot of machine learning work. A lot of this has to do with binary classification – classifying something as either a 0 or a 1. Data scientists build models, which give out a probability score that the thing is a 1, and then use some (sometimes arbitrary) cutoff to determine whether the thing is a 0 or a 1.

There are a bunch of metrics in data science on how good a model is, and it all comes down to what the model predicted and what “really” happened. And I’ve seen data scientists work super hard to improve on these accuracy measures. What can be done to predict a little bit better? Why is this model only giving me 77% ROC-AUC when for the other problem I was able to get 90%?

The thing is – if the variable you are trying to predict is something like whether Salah will score from a particular chance, your accuracy metric will be really low indeed. Because he is fundamentally unpredictable. It is the same with some of the machine learning stuff – a lot of models are trying to predict something that is fundamentally unpredictable, so there is a limit on how accurate the model will get.

The problem is that you would have come across several problem statements that are much more predictable that you think it is a problem with you (or your model) that you can’t predict better. Pundits (or Jose) would have seen so many strikers who predictably score from good chances that they think Salah is not good.

The solution in these cases is to look at aggregates. Looking for each single prediction will not take us anywhere. Instead, can we predict over a large set of data whether we broadly got it right? In my “research” for this blogpost, I found this.

Last season, on average, Salah scored precisely as many goals as the model would’ve predicted! You might remember stunners like the one against Manchester City at Anfield. So you know where things got averaged out.

So many numbers! Must be very complicated!

The story dates back to 2007. Fully retrofitting, I was in what can be described as my first ever “data science job”. After having struggled for several months to string together a forecasting model in Java (the bugs kept multiplying and cascading), I’d given up and gone back to the familiarity of MS Excel and VBA (remember that this was just about a year after I’d finished my MBA).

My seat in the office was near a door that led to the balcony, where smokers would gather. People walking to the balcony, with some effort, could see my screen. No doubt most of them would’ve seen my spending 90% (or more) of my time on Google Talk (it’s ironical that I now largely use Google Chat for work). If someone came at an auspicious time, though, they would see me really working, which was using MS Excel.

I distinctly remember this one time this guy who shared my office cab walked up behind me. I had a full sheet of Excel data and was trying to make sense of it. He took one look at my screen and exclaimed, “oh, so many numbers! Must be very complicated!” (FWIW, he was a software engineer). I gave him a fairly dirty look, wondering what was complicated about a fairly simple dataset on Excel. He moved on, to the balcony. I moved on, with my analysis.

It is funny that, fifteen years down the line, I have built my career in data science. Yet, I just can’t make sense of large sets of numbers. If someone sends me a sheet full of numbers I can’t make out the head or tail of it. Maybe I’m a victim of my own obsessions, where I spend hours visualising data so I can make some sense of it – I just can’t understand matrices of numbers thrown together.

At the very least, I need the numbers formatted well (in an Excel context, using either the “,” or “%” formats), with all numbers in a single column right aligned and rounded off to the exact same number of decimal places (it annoys me that by default, Excel autocorrects “84.0” (for example) to “84” – that disturbs this formatting. Applying “,” fixes it, though). Sometimes I demand that conditional formatting be applied on the numbers, so I know which numbers stand out (again I have a strong preference for red-white-green (or green-white-red, depending upon whether the quantity is “good” or “bad”) formatting). I might even demand sparklines.

But send me a sheet full of numbers and without any of the above mentioned decorations, and I’m completely unable to make any sense or draw any insight out of it. I fully empathise now, with the guy who said “oh, so many numbers! must be very complicated!”

And I’m supposed to be a data scientist. In any case, I’d written a long time back about why data scientists ought to be good at Excel.