Code Density

As many of the regular readers of this blog know, I largely use R for most of my data work. There have been a few occasions when I’ve tried to use Python, but have found that I’m far less efficient in that than I am with R, and so abandoned it, despite the relative ease of putting things into production.

Now in my company, like in most companies, people use both Python and R (the team that reports to me largely uses R, everyone else largely uses Python). And while till recently I used to claim that I’m multilingual in the sense that I can read Python code fairly competently, of late I’m not sure I am. I find it increasingly difficult to parse and read production grade python code.

And now, after some experiments with ChatGPT, and exploring other people’s codes, I have an idea on why I’m finding it hard to read production-grade Python code. It has to do with “code density”.

Of late I’ve been experimenting with Spark (finally, in this job I do a lot of “big data” work – something I never had to in my consulting career prior to this). Related to this, I was reading someone’s PySpark code.

And then it hit me – the problem (rather, my problem) with Python is that it is far more verbose than R. The number of characters or lines of code required to do something in Python is far more than what you need in R (especially if you are using the tidyverse family of packages, which I do, including sparklyr for spark).

Why does the density of code matter? It is to do with aesthetics and modularity and ease of understanding.

Yesterday I was writing some code that I plan to put into production. After a few hours of coding, I looked at the code and felt disgusted with myself – it was a way too long monolithic block of code. It might have been good when I was writing it, but I knew that if I were to revisit it in a week or two, I wouldn’t be able to understand what the hell was happening there.

I’ve never worked as a professional software engineer, but with the amount of coding I’ve done, I’ve worked out what is a “reasonable length for a code block”. It’s like that apocryphal story of Indian public examiners for high school exams who evaluate history answers based on how long they are – “if they were to place an ordinary Reynolds 045 pen vertically on the sheet, the answer should be longer than that for the student to get five marks”.

An answer in a high school history exam needs to be longer than this. A code block or function should be shorter than this

It’s the reverse here. Approximately speaking, if you were to place a Reynolds pen vertically on screen (at your normal font size), no function (or code block) can be longer than the pen.

This easily approximates how much the eye can see on one normal Macbook screen (I use a massive external monitor at work, and a less massive, but equally wide, one at home). If you have to keep scrolling up and down to understand the full logic, there is a higher chance you might make mistakes, and higher difficulty for someone to understand the code.

Till recently (as in earlier this week) I would crib like crazy that people coding in Python would make their code “too modular”. That I would have to keep switching back and forth between different functions (and classes!!) to make sense of some logic, and about how that would make codes hard to debug (I still think there is a limit to how modular you can make your ETL code).

Now, however (I’m writing this on a Saturday – I’m not working today), from the code density perspective, it all makes sense to me.

The advantage of R is that because the code is far denser, you can pack in a far greater amount of logic in a Reynolds pen length of code. So over the years I’ve gotten used to having this much logic being presented to me in one chunk (without having to scroll or switch functions).

The relatively lower density of Python, however, means that the same amount of logic that would be one function in R is now split across several different functions. It is not that the writer of the code is “making this too modular” or “writing functions just for the heck of it”. It is just that their “mental Reynolds pens” doesn’t allow them to pack in more lines in a chunk or function, and Python’s density means there is only so much logic that can go in there.

As part of my undergrad, I did a course on Software Engineering (and the one thing the professor told us then was that we should NOT take up software engineering as a career – “it’s a boring job”, he had said). In that, one of the things we learnt was that in conventional software services contexts, billing would happen as a (nonlinear) function of “kilo lines of code” (this was in early 2003).

Now, looking back, one thing I can say is that the rate per kilo line of R code ought to be much higher than the rate per kilo line of Python code.

Cross posted on my now-largely-dormant Art of Data Science newsletter

Nested Ternary Operators

It’s nearly twenty years since I first learnt to code, and the choice of language (imposed by school, among others) then was C. One of the most fascinating things about C was what was simply called the “ternary operator”, which is kinda similar to the IF statement in Excel, ifelse statement in R and np.where statement in Python.

Basically the ternary operator consisted of a ‘?’ and a ‘:’. It was a statement that took the form of “if this then that else something else”. So, for example, if you had two variables a and b, and had to return the maximum of them, you could use the ternary operator to say a>b?a:b.

Soon I was attending programming contests, where there would be questions on debugging programs. These would inevitably contain one question on ternary operators. A few years later I started attending job interviews for software engineering positions. The ternary operator questions were still around, except that now it would be common to “nest” ternary operators (include one inside the other). It became a running joke that the only place you’d see nested ternary operators was in software engineering interviews.

The thing with the ternary operator is that while it allows you to write your program in fewer lines of code and make it seem more concise, it makes the code a lot less readable. This in turn makes it hard for people to understand your code, and thus makes it hard to debug. In that sense, using the operator while coding in C is not considered particularly good practice.

It’s nearly 2018 now, and C is not used that much nowadays, so the ternary operator, and the nested ternary operator, have made their exit – even from programming interviews if I’m not wrong. However, people still continue to maintain this practice of writing highly optimised code.

Now, every programmer who thinks he’s a good programmer likes to write efficient code. There’s this sense of elegance about code written in a rather elegant manner, using only a few lines. Sometimes such elegant code is also more efficient, speeding up computation and consuming less memory (think, for example, vectorised operations in R).

The problem, however, is that such elegance comes with a tradeoff with readability. The more optimised a piece of code is, the harder it is for someone else to understand it, and thus the harder it is to debug. And the more complicated the algorithm being coded, the worse it gets.

It makes me think that the reason all those ternary operators used to appear in those software engineering interviews (FYI I’ve never done a software engineering job) is to check if you’re able to read complicated code that others write!

Data Science and Software Engineering

I’m a data scientist. I’m good with numbers, and handling large and medium sized data sets (that doesn’t mean I’m bad at handling small data sets, of course). The work-related thing that gives me most kicks is to take a bunch of data and through a process of simple analysis, extract information out of it. To twist and turn the data, or to use management jargon “slice and dice”, and see things that aren’t visible to too many people. To formulate hypotheses, and use data to prove or disprove them. To represent data in simple but intuitive formats (i.e. graphs) so as to convey the information I want to convey.

I can count my last three jobs (including my current one) as being results of my quest to become better at data science and modeling. Unfortunately, none of these jobs have turned out particularly well (this includes my current one). The problem has been that in all these jobs, data science has been tightly coupled with software engineering, and I suck at software engineering.

Let me stop for a moment and tell you that I don’t mind programming. In fact, I love programming. I love writing code that makes my job easier, and automates things, and gives me data in formats that I desire. But I hate software engineering. Of writing code within a particular system, or framework. Or adhering to standards that someone else sets for “good code”. Of following processes and making my code usable by some dumbfuck somewhere else who wouldn’t get it if I wrote it the way I wanted. As I’d mentioned earlier, I like coding for myself. I don’t like coding for someone else. And so I suck at software engineering.

Now I wonder if it’s possible at all to decouple data science from software engineering. My instinct tells me that it should be possible. That I need not write production-level code in order to turn my data-based insights into commercially viable form. Unfortunately, in my search around the corporatosphere thus far, I haven’t been able to find something of the sort.

Which makes me wonder if I should create my own niche, rather than hoping for someone else to create it for me.