I don’t know which 80%

Legendary retailer John Wanamaker (who pioneered fixed price stores in the mid 1800s) is supposed to have said that “half of all advertising is useless. The trouble is I don’t know which half”.

I was playing around with my twitter archive data, and was looking at the distribution of retweets and favourites across all my tweets. To say that it follows a power law is an understatement.

Before this blog post triggers an automated tweet, I have 63793 tweets, of which 59,275 (93%) have not had a single retweet. 51,717 (81%) have not had a single person liking them. And 50, 165 (79%) of all my tweets have not had a single retweet or a favourite.

In other words, nearly 80% of all my tweets had absolutely no impact on the world. They might as well have not existed. Which means that I should cut down my time spent tweeting down to a fifth. Just that, to paraphrase Wanamaker, I don’t know which four fifths I should eliminate!

There is some good news, though. Over time, the proportion of my tweets that has no impact (in terms of retweets or favourites – the twitter dump doesn’t give me the number of replies to a tweet) has been falling consistently.

Right now, this month, the score is around 33% or so. So even though the proportion of my useless tweets have been dropping over time, even now one in every tweets that I tweet has zero impact.

My “most impactful tweet” itself account for 17% of all retweets that I’ve got. Here I look at what proportion of tweets have accounted for what proportion of “reactions” (reactions for each tweet is defined as the sum of number of retweets and number of favourites. I understand that the same person might have been retweeted and favourited something, but I ignore that bit now).

Notice how extreme the graph is. 0.7% of all my tweets have accounted for 50% of all retweets and likes! 10% of all my tweets have accounted for 90% of all retweets and likes.

Even if I look only at recent data, it doesn’t change shape that much – starting from January 2019, 0.8% of my tweets have accounted for 50% of all retweets and likes.

This, I guess, is the fundamental nature of social media. The impact of a particular tweet follows a power law with a very small exponent (meaning highly unequal).

What this also means is that anyone can go viral. Anyone from go from zero to hero in a single day. It is very hard to predict who is going to be a social media sensation some day.

So it’s okay that 80% of my tweets have no traction. I got one blockbuster, and who knows – I might have another some day. I guess such blockbusters is what we live for.

The World After Overbooking

Why do you think you usually have to wait so much to see a doctor, even when you have an appointment? It is because doctors routinely overbook.

You can think of a doctor’s appointment as being a free option. You call up, give your patient number, and are assigned a slot when the doctor sees you. If you choose to see the doctor at that time, you get the doctor’s services, and then pay for the service. If you choose to not turn up, the doctor’s time in that slot is essentially wasted, since there is nobody else to see then. The doctor doesn’t get compensated for this as well.

In order to not waste their time, thus, doctors routinely overbook patients. If the average patient takes fifteen minutes to see, they give appointments once every ten minutes, in the hope of building up a buffer so that their time is not wasted. This way they protect their incomes, and customers pay for this in terms of long waiting hours.

Now, in the aftermath of the covid crisis, this will need to change. People won’t want to spend long hours in a closed waiting room with scores of other sick people. In an ideal world, doctors will want to not let two of their patients even see each other, since that could mean increased disease transmission.

In the inimitable words of Ravishastri, “something’s got to give”.

One way could be for doctors to simply up their fees and give out appointments at intervals that better reflect the time taken per patient. The problem with this is that there are reputation costs to upping fee per patient, and doctors simply aren’t conditioned to unexpected breaks between patients. Moreover, lower number of slots might mean appointments not being available for several days together, and higher cancellations as well, both problems that doctors want to avoid.

As someone with a background in financial derivatives, there is one obvious thing to tackle – the free option being given to patients in terms of the appointment. What if you were to charge people for making appointments?

Now, taking credit card details at the time of booking is not efficient. However, assuming that most patients a doctor sees are “repeat patients”, just keeping track of who didn’t turn up for appointments can be used to charge them extra on the next visit (this needs to have been made clear in advance, at the time of making the appointment).

My take is that even if this appointment booking cost is trivial (say 5% of the session fee), people are bound to take the appointments more seriously. And when people take their appointments more seriously, the amount of buffer built in by doctors in their schedules can be reduced. Which means they can give out appointments at more realistic intervals. Which also means their income overall is protected, while still maintaining social distancing among patients.

I remember modelling this way back when I was working in air cargo pricing. There again, free options abound. I remember building this model that showed that charging a nominal fee for the options could result in a much lower fee for charging the actual cargo. A sort of win-win for customers and airlines alike. Needless to say, I was the only ex-derivatives guy around and it proved to be a really hard sell everywhere.

However, the concept remains. When options that have hitherto been free get monetised, it will lead to a win-win situation and significantly superior experience for all parties involved. The only caveat is that the option pricing should be implemented in a manner with as little friction as possible, else transaction costs can overwhelm the efficiency gains.

More on covid testing

There has been a massive jump in the number of covid-19 positive cases in Karnataka over the last couple of days. Today, there were 44 new cases discovered, and yesterday there were 36. This is a big jump from the average of about 15 cases per day in the preceding 4-5 days.

The good news is that not all of this is new infection. A lot of cases that have come out today are clusters of people who have collectively tested positive. However, there is one bit from yesterday’s cases (again a bunch of clusters) that stands out.

Source: covid19india.org

I guess by now everyone knows what “travelled from Delhi” is a euphemism for. The reason they are interesting to me is that they are based on a “repeat test”. In other words, all these people had tested negative the first time they were tested, and then they were tested again yesterday and found positive.

Why did they need a repeat test? That’s because the sensitivity of the Covid-19 test is rather low. Out of every 100 infected people who take the test, only about 70 are found positive (on average) by the test. That also depends upon when the sample is taken.  From the abstract of this paper:

Over the four days of infection prior to the typical time of symptom onset (day 5) the probability of a false negative test in an infected individual falls from 100% on day one (95% CI 69-100%) to 61% on day four (95% CI 18-98%), though there is considerable uncertainty in these numbers. On the day of symptom onset, the median false negative rate was 39% (95% CI 16-77%). This decreased to 26% (95% CI 18-34%) on day 8 (3 days after symptom onset), then began to rise again, from 27% (95% CI 20-34%) on day 9 to 61% (95% CI 54-67%) on day 21.

About one in three (depending upon when you draw the sample) infected people who have the disease are found by the test to be uninfected. Maybe I should state it again. If you test a covid-19 positive person for covid-19, there is almost a one-third chance that she will be found negative.

The good news (at the face of it) is that the test has “high specificity” of about 97-98% (this is from conversations I’ve had with people in the know. I’m unable to find links to corroborate this), or a false positive rate of 2-3%. That seems rather accurate, except that when the “prior probability” of having the disease is low, even this specificity is not good enough.

Let’s assume that a million Indians are covid-19 positive (the official numbers as of today are a little more than one-hundredth of that number). With one and a third billion people, that represents 0.075% of the population.

Let’s say we were to start “random testing” (as a number of commentators are advocating), and were to pull a random person off the street to test for Covid-19. The “prior” (before testing) likelihood she has Covid-19 is 0.075% (assume we don’t know anything more about her to change this assumption).

If we were to take 20000 such people, 15 of them will have the disease. The other 19985 don’t. Let’s test all 20000 of them.

Of the 15 who have the disease, the test returns “positive” for 10.5 (70% accuracy, round up to 11). Of the 19985 who don’t have the disease, the test returns “positive” for 400 of them (let’s assume a specificity of 98% (or a false positive rate of 2%), placing more faith in the test)! In other words, if there were a million Covid-19 positive people in India, and a random Indian were to take the test and test positive, the likelihood she actually has the disease is 11/411 = 2.6%.

If there were 10 million covid-19 positive people in India (no harm in supposing), then the “base rate” would be .75%. So out of our sample of 20000, 150 would have the disease. Again testing all 20000, 105 of the 150 who have the disease would test positive. 397 of the 19850 who don’t have the disease will test positive. In other words, if there were ten million Covid-19 positive people in India, and a random Indian were to take the test and test positive, the likelihood she actually has the disease is 105/(397+105) = 21%.

If there were ten million Covid-19 positive people in India, only one-fifth of the people who tested positive in a random test would actually have the disease.

Take a sip of water (ok I’m reading The Ken’s Beyond The First Order too much nowadays, it seems).

This is all standard maths stuff, and any self-respecting book or course on probability and Bayes’s Theorem will have at least a reference to AIDS or cancer testing. The story goes that this was a big deal in the 1990s when some people suggested that the AIDS test be used widely. Then, once this problem of false positives and posterior probabilities was pointed out, the strategy of only testing “high risk cases” got accepted.

And with a “low incidence” disease like covid-19, effective testing means you test people with a high prior probability. In India, that has meant testing people who travelled abroad, people who have come in contact with other known infected, healthcare workers, people who attended the Tablighi Jamaat conference in Delhi, and so on.

The advantage with testing people who already have a reasonable chance of having the disease is that once the test returns positive, you can be pretty sure they actually have the disease. It is more effective and efficient. Testing people with a “high prior probability of disease” is not discriminatory, or a “sampling bias” as some commentators alleged. It is prudent statistical practice.

Again, as I found to my own detriment with my tweetstorm on this topic the other day, people are bound to see politics and ascribe political motives to everything nowadays. In that sense, a lot of the commentary is not surprising. It’s also not surprising that when “one wing” heavily retweeted my article, “the other wing” made efforts to find holes in my argument (which, again, is textbook math).

One possibly apolitical criticism of my tweetstorm was that “the purpose of random testing is not to find out who is positive. It is to find out what proportion of the population has the disease”. The cost of this (apart from the monetary cost of actually testing) are threefold. Firstly, a large number of uninfected people will get hospitalised in covid-specific hospitals, clogging hospital capacity and increasing the chances that they get infected while in hospital.

Secondly, getting a truly random sample in this case is tricky, and possibly unethical. When you have limited testing capacity, you would be inclined (possibly morally, even) to use it on people who already have a high prior probability.

Finally, when the incidence is small, we need a really large sample to find out the true range.

Let’s say 1 in 1000 Indians have the disease (or about 1.35 million people). Using the Chi Square test of proportions, our estimate of the incidence of the disease varies significantly on how many people are tested.

If we test a 1000 people and find 1 positive, the true incidence of the disease (95% confidence interval) could be anywhere from 0.01% to 0.65%.

If we test 10000 people and find 10 positive, the true incidence of the disease could be anywhere between 0.05% and 0.2%.

Only if we test 100000 people (a truly massive random sample) and find 100 positive, then the true incidence lies between 0.08% and 0.12%, an acceptable range.

I admit that we may not be testing enough. A simple rule of thumb is that anyone with more than a 5% prior probability of having the disease needs to be tested. How we determine this prior probability is again dependent on some rules of thumb.

I’ll close by saying that we should NOT be doing random testing. That would be unethical on multiple counts.

Arzoos

Founders, once they have a successful exit, tend to treat themselves as Gods.

Investors bow to them, and possibly recruit them into their investment teams. Startups flock to them, in the hope that they might use their recently gained wealth to invest in these companies. Having produced one successful exit, people assume that these people have “cracked the startup game”.

And so even if they have started humbly after their exit, all this adulation, and the perceived to potentially make or break a company by pulling out their chequebooks, goes to their head and the successful exit founders start treating themselves as Gods. And they believe that their one successful exit, which might have come for whatever reason (including a healthy dose of luck), makes them an authority to speak on pretty much any topic under the sun.

Now, I’m not grudging their money. There would have been something in the companies that they built, including timing or luck, even, that makes these people deserving of all the money they’ve made. What irritates me is their attitude of “knowing the mantra to be successful”, which allows them to comment on pretty much any issue or company, thinking people will take them seriously.

Recently I’ve come up with a word to represent all these one-time-successful founders who then flounder while dispensing advice – “Arzoos”.

The name of course alludes to Arzoo.com, which Sabeer Bhatia started after selling Hotmail to Microsoft. He had made a massive exit, and was one of the poster children of the dotcom boom (before the bust), especially in his native India. Except that the next company he started (Arzoo) sank without a trace to the extent that nobody even knows (or remembers) what the company did.

There is a huge dose of luck involved in making a small company successful, and that someone had a good exit doesn’t necessarily mean that they are great businessmen. As a corollary, that someone’s startup failed doesn’t make them bad businessmen.

Then again, it is part of human nature that we attribute all our successes to skill, and all our failures to bad luck!

 

Marginalised communities and success

Yesterday I was listening to this podcast where Tyler Cowen interviews Neal Stephenson, who is perhaps the only Science Fiction author whose books I’ve read. Cowen talks about the characters in Stephenson’s The Baroque Cycle, a masterful 3000-page work which I polished off in a month in 2014.

The key part of the conversation for me is this:

COWEN: Given your focus on the Puritans and the Baroque Cycle, do you think Christianity was a fundamental driver of the Industrial Revolution and the Scientific Revolution, and that’s why it occurred in northwestern Europe? Or not?

STEPHENSON: One of the things that comes up in the books you’re talking about is the existence of a certain kind of out-communities that were weirdly overrepresented among people who created new economic systems, opened up new trade routes, and so on.

I’m talking about Huguenots, who were the Protestants in France who suffered a lot of oppression. I’m talking about the Puritans in England, who were not part of the established church and so also came in for a lot of oppression. Armenians, Jews, Parsis, various other minority communities that, precisely because of their outsider minority status, were forced to form long-range networks and go about things in an unconventional, innovative way.

So when we think about communities such as Jews or Parsis, and think about their outsized contribution to business or culture, it is this point that Stephenson makes that we should keep in mind. Because Jews and Parsis and Armenians were outsiders, they were “forced to form long-range networks”.

In most cases, for most people of these communities, these long-range networks and unconventional way of doing things didn’t pay off, and they ended up being worse off compared to comparable people from the majority communities in wherever they lived.

However, in the few cases where these long-range networks and innovative ways of doing things succeeded, they succeeded spectacularly. And these incidents are cases that we have in mind when we think about the spectacular success or outsized contributions of these communities.

Another way to think of this is – denied “normal life”, people from marginalised communities were forced to take on much more risk in life. The expected value of this risk might have been negative, but this higher risk meant that these communities had a much better “upper tail” than the majority communities that suppressed and oppressed them.

Given that in terms of long-term contributions and impact and public visibility it is only the tails of the distribution that matter (mediocrity doesn’t make news), we think of these communities as having been extraordinary, and wonder if they have “better genes” and so on.

It’s a simple case of risk, and oppression. This, of course, is no justification for oppressing swathes of people and forcing them to take more risks than necessary. People need to decide on their own risk preferences.

Correlation and causation

So I have this lecture on “smelling (statistical) bullshit” that I’ve delivered in several places, which I inevitably start with a lesson on how correlation doesn’t imply causation. I give a large number of examples of people mistaking correlation for causation, the class makes fun of everything that doesn’t apply to them, then everyone sees this wonderful XKCD cartoon and then we move on.

One of my favourite examples of correlation-causation (which I don’t normally include in my slides) has to do with religion. Praying before an exam in which one did well doesn’t necessarily imply that the prayer resulted in the good performance in the exam, I explain. So far, there has been no outward outrage at my lectures, but this does visibly make people uncomfortable.

Going off on a tangent, the time in life when I discovered to myself that I’m not religious was when I pondered over the correlation-causation issue some six or seven years back. Until then I’d had this irrational need to draw a relationship between seemingly unrelated things that had happened together once or twice, and that had given me a lot of mental stress. Looking at things from a correlation-causation perspective, however, helped clear up my mind on those things, and also made me believe that most religious activity is pointless. This was a time in life when I got immense mental peace.

Yet, for most of the world, it is not freedom from religion but religion itself that gives them mental peace. People do absurd activities only because they think these activities lead to other good things happening, thanks to a small number of occasions when these things have coincided, either in their own lives or in the lives of their ancestors or gurus.

In one of my lectures a few years back I had remarked that one reason why humans still mistake correlation for causation is religion – for if correlation did not imply causation then most of religious rituals would be rendered meaningless and that would render people’s lives meaningless. Based on what I observed today, however, I think I’ve got this causality wrong.

It’s not because of religion that people mistake correlation for causation. Instead, we’ve evolved to recognise patterns whenever we observe them, and a side effect of that is that we immediately assume causation whenever we see things happening together. Religion is just a special case of application of this correlation-causation second nature to things in real life.

So my daughter (who is two and a half) and I were standing in our balcony this evening, observing that it had rained heavily last night. Heavy rain reminded my daughter of this time when we had visited a particular aunt last week – she clearly remembered watching the heavy rain from this aunt’s window. Perhaps none of our other visits to this aunt’s house really registered in the daughter’s imagination (it’s barely two months since we returned to Bangalore, so admittedly there aren’t that many data points), so this aunt’s house is inextricably linked in her mind to rain.

And this evening because she wanted it to rain heavily again, the daughter suggested that we go visit this aunt once again. “We’ll go to Inna Ajji’s house and then it will start raining”, she kept saying. “Yes, it rained the last time it went there, but it was random. It wasn’t because we went there”, I kept saying. It wasn’t easy to explain it.

You know when you are about to have a kid you develop visions of how you’ll bring her up, and what you’ll teach her, and what she’ll say to “jack” the world. Back then I’d decided that I’d teach my yet-unborn daughter that “correlation does not imply causation” and she could use it use it against “elders” who were telling her absurd stuff.

I hadn’t imagined that mistaking correlation for causation is so fundamental to human nature that it would be a fairly difficult task to actually teach my daughter that correlation does not imply causation! Hopefully in the next one year I can convince her.

What Ails Liverpool

So Liverpool FC has had a mixed season so far. They’re second in the Premier League with 36 points from 14 games (only points dropped being draws against ManCity, Chelsea and Arsenal), but are on the verge of going out of the Champions League, having lost all three away games.

Yesterday’s win over Everton was damn lucky, down to a 96th minute freak goal scored by Divock Origi (I’d forgotten he’s still at the club). Last weekend’s 3-0 against Watford wasn’t as comfortable as the scoreline suggested, the scoreline having been opened only midway through the second half. The 2-0 against Fulham before that was similarly a close-fought game.

Of concern to most Liverpool fans has been the form of the starting front three – Mo Salah, Roberto Firmino and Sadio Mane. The trio has missed a host of chances this season, and the team has looked incredibly ineffective in the away losses in the Champions League (the only shot on target in the 2-1 loss against PSG being the penalty that was scored by Milner).

There are positives, of course. The defence has been tightened considerably compared to last season. Liverpool aren’t leaking goals the way they did last season. There have been quite a few clean sheets so far this season. So far there has been no repeat of last season’s situation where they went 4-1 up against ManCity, only to quickly let in two goals and then set up a tense finish.

So my theory is this – each of the front three of Liverpool has an incredibly low strike rate. I don’t know if the xG stat captures this, but the number of chances required by each of Mane, Salah and Firmino before they can convert is rather low. If the average striker converts one in two chances, all of these guys convert one in four (these numbers are pulled out of thin air. I haven’t looked at the statistics).

And even during the “glory days” of last season when Liverpool was scoring like crazy, this low strike rate remained. Instead, what helped then was a massive increase in the number of chances created. The one game I watched live (against Spurs at Wembley), what struck me was the number of chances Salah kept missing. But as the chances kept getting created, he ultimately scored one (Liverpool lost 4-1).

What I suspect is that as Klopp decided to tighten things up at the back this season, the number of chances being created has dropped. And with the low strike rate of each of the front three, this lower number of chances translates into much lower number of goals being scored. If we want last season’s scoring rate, we might also have to accept last season’s concession rate (though this season’s goalie is much much better).

There ain’t no such thing as a free lunch.

Randomness and sample size

I have had a strange relationship with volleyball, as I’ve documented here. Unlike in most other sports I’ve played, I was a rather defensive volleyball player, excelling in backline defence, setting and blocking, rather than spiking.

The one aspect of my game which was out of line with the rest of my volleyball, but in line with my play in most other sports I’ve played competitively, was my serve. I had a big booming serve, which at school level was mostly unreturnable.

The downside of having an unreturnable serve, though, is that you are likely to miss your serve more often than the rest – it might mean hitting it too long, or into the net, or wide. And like in one of the examples I’ve quoted in my earlier post, it might mean not getting a chance to serve at all, as the warm up serve gets returned or goes into the net.

So I was discussing my volleyball non-career with a friend who is now heavily involved in the game, and he thought that I had possibly been extremely unlucky. My own take on this is that given how little I played, it’s quite likely that things would have gone spectacularly wrong.

Changing domains a little bit, there was a time when I was building strategies for algorithmic trading, in a class known as “statistical arbitrage”. The deal there is that you have a small “edge” on each trade, but if you do a large enough number of trades, you will make money. As it happened, the guy I was working for then got spooked out after the first couple of trades went bad and shut down the strategy at a heavy loss.

Changing domains a little less this time, this is also the reason why you shouldn’t check your portfolio too often if you’re investing for the long term – in the short run, when there have been “fewer plays”, the chances of having a negative return are higher even if you’re in a mostly safe strategy, as I had illustrated in this blog post in 2008 (using the Livejournal URL since the table didn’t port well to wordpress).

And changing domains once again, the sheer number of “samples” is possibly one reason that the whole idea of quantification of sport and “SABRmetrics” first took hold in baseball. The Major League Baseball season is typically 162 games long (and this is before the playoffs), which means that any small edge will translate into results in the course of the league. A smaller league would mean fewer games and thus more randomness, and a higher chance that a “better play” wouldn’t work out.

This also explains why when “Moneyball” took off with the Oakland A’s in the 1990s, they focussed mainly on league performance and not performance in the playoffs – in the latter, there are simply not enough “samples” for a marginal advantage in team strength to necessarily have the impact in terms of results.

And this is the problem with newly appointed managers of elite football clubs in Europe “targeting the Champions League” – a knockout tournament of that format means that the best team need not always win. Targeting a national league, played out over at least 34 games in the season is a much better bet.

Finally, there is also the issue of variance. A higher variance in performance means that observations of a few instances of bad performance is not sufficient to conclude that the player is a bad performer – a great performance need not be too far away. For a player with less randomness in performance – a more steady player, if you will – a few bad performances will tell you that they are unlikely to come good. High risk high return players, on the other hand, need to be given a longer rope.

I’d put this in a different way in a blog a few years back, about Mitchell Johnson.

Religion and survivorship bias

Biju Dominic of FinalMile Consulting has a piece in Mint about “what CEOs can learn from religion“. In that, he says,

Despite all the hype, the vast majority of these so-called highly successful, worthy of being emulated companies, do not survive even for few decades. On the other hand, religion, with all its inadequacies, continues to survive after thousands of years.

This is a fallacious comparison.

Firstly, comparing “religion” to a particular company isn’t dimensionally consistent. A better comparison would be to compare at the conceptual level – such as comparing “religion” to “joint stock company”. And like the former, the latter has done rather well for 300 years now, even if specific companies may fold up after a few years.

The other way to make an apples-to-apples comparison is to compare a particular company to a particular religion. And this is where survivorship bias comes in.

Most of the dominant religions of today are more than hundreds or thousands of years old. In the course of their journey to present-day strength, they have first established their own base and then fought off competition from other upstart religions.

In other words, when Dominic talks about “religion” he is only taking into account religions that have displayed memetic fitness over a really long period. What he fails to take account of are the thousands of startup religions that get started up once every few years and then fade into nothingness.

Historically, such religions haven’t been well documented, but that doesn’t mean they didn’t exist. In contemporary times, one can only look at the thousands of “babas” with cults all around India – each is leading his/her own “startup religion”, and most of them are likely to sink without a trace.

Comparing the best in one class (religions that have survived and thrived over thousands of years) to the average of another class (the average corporation) just doesn’t make sense!

 

Astrology and Data Science

The discussion goes back some 6 years, when I’d first started setting up my data and management consultancy practice. Since I’d freshly quit my job to set up the said practice, I had plenty of time on my hands, and the wife suggested that I spend some of that time learning astrology.

Considering that I’ve never been remotely religious or superstitious, I found this suggestion preposterous (I had a funny upbringing in the matter of religion – my mother was insanely religious (including following a certain Baba), and my father was insanely rationalist, and I kept getting pulled in both directions).

Now, the wife has some (indirect) background in astrology. One of her aunts is an astrologer, and specialises in something called “prashNa shaastra“, where the prediction is made based on the time at which the client asks the astrologer a question. My wife believes this has resulted in largely correct predictions (though I suspect a strong dose of confirmation bias there), and (very strangely to me) seems to believe in the stuff.

“What’s the use of studying astrology if I don’t believe in it one bit”, I asked. “Astrology is very mathematical, and you are very good at mathematics. So you’ll enjoy it a lot”, she countered, sidestepping the question.

We went off into a long discussion on the origins of astrology, and how it resulted in early developments in astronomy (necessary in order to precisely determine the position of planets), and so on. The discussion got involved, and involved many digressions, as discussions of this sort might entail. And as you might expect with such discussions, my wife threw a curveball, “You know, you say you’re building a business based on data analysis. Isn’t data analysis just like astrology?”

I was stumped (ok I know I’m mixing metaphors here), and that had ended the discussion then.

Until I decided to bring it up recently. As it turns out, once again (after a brief hiatus when I decided I’ll do a job) I’m in process of setting up a data and management consulting business. The difference is this time I’m in London, and that “data science” is a thing (it wasn’t in 2011). And over the last year or so I’ve been kinda disappointed to see what goes on in the name of “data science” around me.

This XKCD cartoon (which I’ve shared here several times) encapsulates it very well. People literally “pour data into a machine learning system” and then “stir the pile” hoping for the results.

Source: https://xkcd.com/1838/

In the process of applying fairly complex “machine learning” algorithms, I’ve seen people not really bother about whether the analysis makes intuitive sense, or if there is “physical meaning” in what the analysis says, or if the correlations actually determine causation. It’s blind application of “run the data through a bunch of scikit learn models and accept the output”.

And this is exactly how astrology works. There are a bunch of predictor variables (position of different “planets” in various parts of the “sky”). There is the observed variable (whether some disaster happened or not, basically), which is nicely in binary format. And then some of our ancients did some data analysis on this, trying to identify combinations of predictors that predicted the output (unfortunately they didn’t have the power of statistics or computers, so in that sense the models were limited). And then they simply accepted the outputs, without challenging why it makes sense that the position of Jupiter at the time of wedding affects how your marriage will go.

So I brought up the topic of astrology and data science again recently, saying “OK after careful analysis I admit that astrology is the oldest form of data science”. “That’s not what I said”, the wife countered. “I said that data science is new age astrology, and not the other way round”.

It’s hard to argue with that!