The Science in Data Science

The science in “data science” basically represents the “scientific method”.

It’s a decade since the phrase “data scientist” got coined, though if you go on LinkedIn, you will find people who claim to have more than two years of experience in the subject.

The origins of the phrase itself are unclear, though some sources claim that it came out of this HBR article in 2012 written by Thomas Davenport and DJ Patil (though, in 2009, Hal Varian, formerly Google’s Chief Economist had said that the “sexiest job of the 21st century” will be that of a statistician).

Some of you might recall that in 2018, I had said that “I’m not a data scientist any more“. That was mostly down to my experience working with companies in London, where I found that data science was used as a euphemism for “machine learning” – something I was incredibly uncomfortable with.

With the benefit of hindsight, it seems like I was wrong. My view on data science being a euphemism for machine learning came from interacting with small samples of people (though it could be an English quirk). As I’ve dug around over the years, it seems like the “science” in data science comes not from the maths in machine learning, but elsewhere.

One phenomenon that had always intrigued me was the number of people with PhDs, especially NOT in maths, computer science of statistics, who have made a career in data science. Initially I dismissed it down to “the gap between PhD and tenure track faculty positions in science”. However, the numbers kept growing.

The more perceptive of you might know that I run a podcast now. It is called “Data Chatter“, and is ten episodes old now. The basic aim of the podcast is for me to have some interesting conversations – and then release them for public benefit. Yeah, yeah.

So, there was this thing that intrigued me, and I have a podcast. I did what you would have expected me to do – get on a guest who went from a science background to data science. I got Dhanya, my classmate from school, to talk about how her background with a PhD in neuroscience has helped her become a better data scientist.

It is a fascinating conversation, and served its primary purpose of making me understand what the “science” in data science really is. I had gone into the conversation expecting to talk about some machine learning, and how that gets used in academia or whatever. Instead, we spoke for an hour about designing experiments, collecting data and testing hypotheses.

The science in “data science” basically represents the “scientific method“. What Dhanya told me (you should listen to the conversation) is that a PhD prepares you for thinking in the scientific method, and drills into you years of practice in it. And this is especially true of “experimental” PhDs.

And then, last night, while preparing the notes for the podcast release, I stumbled upon the original HBR article by Thomas Davenport and DJ Patil talking about “data science”. And I found that they talk about the scientific method as well. And I found that I had talked about it in my newsletter as well – only to forget it later. This is what I had written:

Reading Patil and Davenport’s article carefully suggests, however, that companies might be making a deliberate attempt at recruiting pure science PhDs for data scientist roles.

The following excerpts from the article (which possibly shaped the way many organisations think about data science) can help us understand why PhDs are sought after as data scientists.

  • Data scientists’ most basic, universal skill is the ability to write code. This may be less true in five years’ time (Ed: the article was published in late 2012, so we’re almost “five years later” now)
  • Perhaps it’s becoming clear why the word “scientist” fits this emerging role. Experimental physicists, for example, also have to design equipment, gather data, conduct multiple experiments, and communicate their results.
  • Some of the best and brightest data scientists are PhDs in esoteric fields like ecology and systems biology.
  • It’s important to keep that image of the scientist in mind—because the word “data” might easily send a search for talent down the wrong path

Patil and Davenport make it very clear that traditional “data analysts” may not make for great data scientists.

We learn, and we forget, and we re-learn. But learning is precisely what the scientific method, which underpins the “science” in data science, is all about. And it is definitely NOT about machine learning.

Podcast: All Reals

I had spoken here a few times about starting a new “data podcast, right? The first episode is out today, and in this I speak to S Anand, cofounder and CEO of Gramener, about the interface of business with data science.

It’s a long freewheeling conversation, where we talk about data science in general, about Excel, about data visualisations, pie charts, Tufte and all that.

Do listen – it should be available on all podcast platforms, and let me know what you think. Oh, and don’t forget to subscribe to the podcast. New episodes will be out every Tuesday morning.

And if you think you want to be on the podcast, or know someone who wants to be a guest on the podcast, you can reach out. datachatterpodcast AT gmail.

Launching: Data Chatter

A few weeks back I had mentioned here that I’m starting a podcast. And it is now ready for release. Listen to the trailer here:

It is a series of conversations about all things data. First episode will be out on Tuesday, and then weekly after that. I’ve already built up an inventory of seven episodes. So far I’ve recorded episodes about big data, business intelligence, visualisations, a lot of “domain-specific” analytics, and the history of analytics in India. And many more are to come.

Subscribe to the podcast to be able to listen to it whenever it comes out. It is available on all podcasting platforms. For some reason, Apple is not listed on the anchor site, but if you search for “Data Chatter” on Apple Podcasts, you should find it (I did).

And of course, feedback is welcome (you can just comment on this post). And please share this podcast with whoever else you think might like it.

More on Diversity and Inclusion

Diversity and Inclusion are words that are normally thrown around by people of a certain persuasion. In fact, they were among the key principles espoused by one of my earlier employers as well (to their credit, some of their diversity and inclusion sessions did a lot of help broaden my worldview).

However, as I’ve argued earlier on this blog, in a lot of cases, arguments on diversity and inclusion are (literally) only skin deep – people go big on diversity of sex, sexuality, skin colour, nationality and so on while giving short shrift to things like diversity of thought, which in my opinion plays a larger role in building a more successful team.

I’ve also mentioned earlier on this blog about how some simple acts of inclusion can go a long way – for example, I’d mentioned about how building a pedestrian walkway, or pedestrian crossing with signals, would help make one of the roads in Bangalore more inclusive towards pedestrians (a class of people the usual proponents of diversity and inclusion don’t care about).

I was reminded of diversity and inclusion when the recent hoopla about messaging apps happened. A number of my contacts said they were leaving WhatsApp and moving to Telegram or Signal. Others said they weren’t going anywhere and were sticking to WhatsApp, and that Facebook’s new privacy rules were nothing new.

From my personal point of view, since I didn’t have a view on this messaging apps issue, the best solution turned out to be “inclusion”.

I’m on all apps. I’m on Signal, and Telegram, and WhatsApp, and iMessage, and good old SMS. However you choose to reach me, I’m there to receive your message and respond to you. In that sense, when you don’t have a strong opinion, the best thing to do is to be inclusive.

Of late I’ve realised it’s the same with language. Since I now work for a company that is headquartered in Gurgaon, a number of colleagues instinctively speak in Hindi. Initially I used to be a bit snobbish, and tell them that my Hindi sucked, and when they spoke Hindi, I would reply in English.

Over time, however, I’ve realised that I’m only being an asshole by refusing to be inclusive. Since I know Hindi (I got more marks in Hindi in Class 10 board exams than I did in English – not that that says anything), I should let the people decide whether I’m worth talking to in Hindi at all. I’ll talk to them in my broken Hindi, and if they think it’s too broken they can choose to switch to a language I’m more comfortable in.

And a week ago, Pranay and Saurabh of the Puliyabaazi podcast asked me if I’m willing to go on their (Hindi) podcast to talk about logical fallacies and “how not to use data”. I immediately accepted, not only because it’s a great podcast to be on (they’re fun to talk to), but it also gives me an opportunity to show off my broken Hindi.

The episode dropped on Thursday. You can listen to it here:

I realised while I was recording that my Hindi has become really rusty, and I found myself struggling for words many times. I also realised after the episode dropped that I don’t even understand what the title means, yet I’ve been happily sharing it around in my office! (a colleague kept asking me if I knew this word and that word, and I realised the answer to all that was no. Yet I had made assumptions and gone on with the podcast – another example of my own “inclusiveness”!)

Henceforth I’m never telling a colleague that I don’t know Hindi. However, if I find that someone overestimates my level of Hindi I might inflict this podcast on them. Even then, if they choose to speak to me in Hindi, so be it! I’m going to make an attempt to be more inclusive, after all.

 

Good vodka and bad chicken

When I studied Artificial Intelligence, back in 2002, neural networks weren’t a thing. The limited compute capacity and storage available at that point in time meant that most artificial intelligence consisted of what is called “rule based methods”.

And as part of the course we learnt about machine translation, and the difficulty of getting the implicit meaning across. The favourite example by computer scientists in that time was the story of how some scientists translated “the spirit is willing but the flesh is weak” into Russian using an English-Russian translation software, and then converted it back into English using a Russian-English translation software.

The result was “the vodka is excellent but the chicken is not good”.

While this joke may not be valid any more thanks to the advances in machine translation, aided by big data and neural networks, the issue of translation is useful in other contexts.

Firstly, speaking in a language that is not your “technical first language” makes you eschew jargon. If you have been struggling to get rid of jargon from your professional vocabulary, one way to get around it is to speak more in your native language (which, if you’re Indian, is unlikely to be your technical first language). Devoid of the idioms and acronyms that you normally fill your official conversation with, you are forced to think, and this practice of talking technical stuff in a non-usual language will help you cut your jargon.

There is another use case for using non-standard languages – dealing with extremely verbose prose. A number of commentators, a large number of whom are rather well-reputed, have this habit of filling their columns with flowery language, GRE words, repetition and rhetoric. While there is usually some useful content in these columns, it gets lost in the language and idioms and other things that would make the columnist’s high school English teacher happy.

I suggest that these columns be given the spirit-flesh treatment. Translate them into a non-English language, get rid of redundancies in sentences and then  translate them back into English. This process, if the translators are good at producing simple language, will remove the bluster and make the column much more readable.

Speaking in a non-standard language can also make you get out of your comfort zone and think harder. Earlier this week, I spent two hours recording a podcast in Hindi on cricket analytics. My Hindi is so bad that I usually think in Kannada or English and then translate the sentence “live” in my head. And as you can hear, I sometimes struggle for words. Anyway here is the thing. Listen to this if you can bear to hear my Hindi for over an hour.