Algorithmic curation

When I got my first smartphone (a Samsung Galaxy Note 2) in 2013, one of the first apps I installed on it was Flipboard. I’d seen the app while checking out some phones at either the Apple or Samsung retail outlets close to my home, and it seemed like a rather interesting idea.

For a long time, Flipboard was my go-to app to check the day’s news, as it conveniently categorised news into “tech”, “business” and “sport” and learnt about my preferences and fed me stuff I wanted. And then after some update, it suddenly stopped working – somehow it started serving too much stuff I didn’t want to read about, and when I tuned (by “following” and “unfollowing” topics) my feed, it progressively got worse.

I stopped using it some 2 years back, but out of curiosity started using it again recently. While it did throw up some nice articles, there is too much unwanted stuff in the app. More precisely, there’s a lot of “clickbaity” stuff (“10 things about Narendra Modi you would never want to know” and the like) in my feed, meaning I have to wade through a lot of such articles to find the occasional good ones.

(Aside: I dedicate about half a chapter to this phenomenon in my book. The technical term is “congestion”. I talk about it in the context of markets in relationships and real estate)

Flipboard is not the only one. I use this app called Pocket to bookmark long articles and read later. A couple of years back, Pocket started giving “recommendations” based on what I’d read and liked. Initially it was good, and mostly curated from what my “friends” on Pocket recommended. Now, increasingly I’m getting clickbaity stuff again.

I stopped using Facebook a long time before they recently redesigned their newsfeed (to give more weight to friends’ stuff than third party news), but I suspect that one of the reasons they made the change was the same – the feed was getting overwhelmed with clickbaity stuff, which people liked but didn’t really read.

Basically, there seems to be a widespread problem in a lot of automatically curated news feeds. To put it another way, the clickbaity websites seem to have done too well in terms of gaming whatever algorithms the likes of Facebook, Flipboard and Pocket use to build their automated recommendations.

And more worryingly, with all these curators starting to do badly around the same time (ok this is my empirical observation. Given few data points I might be wrong), it suggests that all automated curation algorithms use a very similar algorithm! And that can’t be a good thing.

Stirring the pile efficiently

Warning: This is a technical post, and involves some code, etc. 

As I’ve ranted a fair bit on this blog over the last year, a lot of “machine learning” in the industry can be described as “stirring the pile”. Regular readers of this blog will be familiar with this image from XKCD by now:

Source: https://xkcd.com/1838/

Basically people simply take datasets and apply all the machine learning techniques they have heard of (implementation is damn easy – scikit learn allows you to implement just about any model in three similar looking lines of code; See my code here to see how similar the implementation is).

So I thought I’ll help these pile-stirrers by giving some hints of what method to use for different kinds of data. I’ve over-simplified stuff, and so assume that:

  1. There are two predictor variables X and Y. The predicted variable “Z” is binary.
  2. X and Y are each drawn from a standard normal distribution.
  3. The predicted variable Z is “clean” – there is a region in the X-Y plane where Z is always “true” and another region where Z is always “false”
  4. So the idea is to see which machine learning techniques are good at identifying which kind of geometrical figures.
  5. Everything is done “in-sample”. Given the nature of the data, it doesn’t matter if we do it in-sample or out-of-sample.

For those that understand Python (and every pile-stirrer worth his salt is excellent at Python), I’ve put my code in a nice Jupyter Notebook, which can be found here.

So this is what the output looks like. The top row shows the “true values” of Z. Then we have a row for each of the techniques we’ve used, which shows how well these techniques can identify the pattern given in the top row (click on the image for full size).

As you can see, I’ve chosen some common geometrical shapes and seen which methods are good at identifying those. A few pertinent observations:

  1. Logistic regression and linear SVM are broadly similar, and both are shit for this kind of dataset. Being linear models, they fail to deal with non-linear patterns
  2. SVM with RBF kernel is better, but it fails when there are multiple “true regions” in the dataset. At least it’s good at figuring out some non-linear patterns. However, it can’t figure out the triangle or square – it draws curves around them, instead.
  3. Naive Bayesian (I’ve never understood this even though I’m pretty good at Bayesian statistics, but I understand this is a commonly used technique; and I’ve used default parameters so not sure how it is “Bayesian” even) can identify some stuff but does badly when there are disjoint regions where Z is true.
  4. Ensemble methods such as Random Forests and Gradient Boosting do rather well on all the given inputs. They do well for both polygons and curves. Elsewhere, Ada Boost mostly does well but trips up on the hyperbola.
  5. For some reason, Lasso fails to give an output (in the true spirit of pile-stirring, I didn’t explore why). Ridge is again a regression method and so does badly on this non-linear dataset
  6. Neural Networks (Multi Layer Perceptron to be precise) does reasonably well, but can’t figure out the sharp edges of the polygons.
  7. Decision trees again do rather well. I’m pleasantly surprised that they pick up and classify the disjoint sets (multi-circle and hyperbola) correctly. Maybe it’s the way scikit learn implements them?

Of course, the datasets that one comes across in real life are never such simple geometrical figures, but I hope that this set can give you some idea on what techniques to use where.

At least I hope that this makes you think about the suitability of different techniques for the data rather than simply applying all the techniques you know and then picking the one that performs best on your given training and test data.

That would count as nothing different from p-hacking, and there’s an XKCD for that as well!

Source: https://xkcd.com/882/

Machine learning and degrees of freedom

For starters, machine learning is not magic. It might appear like magic when you see Google Photos automatically tagging all your family members correctly, down to the day of their birth. It might appear so when Siri or Alexa give a perfect response to your request. And the way AlphaZero plays chess is almost human!

But no, machine learning is not magic. I’d made a detailed argument about that in the second edition of my newsletter (subscribe if you haven’t already!).

One way to think of it is that the output of a machine learning model (which could be anything from “does this picture contain a cat?” to “is the speaker speaking in English?”) is the result of a mathematical formula, whose parameters are unknown at the beginning of the exercise.

As the system gets “trained” (of late I’ve avoided using the word “training” in the context of machine learning, preferring to use “calibration” instead. But anyway…), the hitherto unknown parameters of the formula get adjusted in a manner that the formula output matches the given data. Once the system has “seen” enough data, we have a model, which can then be applied on unknown data (I’m completely simplifying it here).

The genius in machine learning comes in setting up mathematical formulae in a way that given input-output pairs of data can be used to adjust the parameters of the formulae. The genius in deep learning, which has been the rage this decade, for example, comes from a 30-year old mathematical breakthrough called “back propagation”. The reason it took until a few years back for it to become a “thing” has to do with data availability, and compute power (check this terrific piece in the MIT Tech Review about deep learning).

Within machine learning, the degree of complexity of a model can vary significantly. In an ordinary univariate least squares regression, for example, there are only two parameters the system can play with (slope and intercept of the regression line). Even a simple “shallow” neural network, on the other hand, has thousands of parameters.

Because a regression has so few parameters, the kind of patterns that the system can detect is rather limited (whatever you do, the system can only draw a line. Nothing more!). Thus, regression is applied only when you know that the relationship that exists is simple (and linear), or when you are trying to force-fit a linear model.

The upside of simple models such as regression is that because there are so few parameters to be adjusted, you need relatively few data points in order to adjust them to the required degree of accuracy.

As models get more and more complicated, the number of parameters increases, thus increasing the complexity of patterns that can be detected by the system. Close to one extreme, you have systems that see lots of current pictures of you and then identify you in your baby pictures.

Such complicated patterns can be identified because the system parameters have lots of degrees of freedom. The downside, of course, is that because the parameters start off having so much freedom, it takes that much more data to “tie them down”. The reason Google Photos can tag you in your baby pictures is partly down to the quantum of image data that Google has, which does an effective job of tying down the parameters. Google Translate similarly uses large repositories of multi-lingual text in order to “learn languages”.

Like most other things in life, machine learning also involves a tradeoff. It is possible for systems to identify complex patterns, but for that you need to start off with lots of “degrees of freedom”, and then use lots of data to tie down the variables. If your data is small, then you can only afford a small number of parameters, and that limits the complexity of patterns that can be detected.

One way around this, of course, is to use your own human intelligence as a pre-processing step in order to set up parameters in a way that they can be effectively tuned by data. Gopi had a nice post recently on “neat learning versus deep learning“, which is relevant in this context.

Finally, there is the issue of spurious correlations. Because machine learning systems are basically mathematical formulae designed to learn patterns from data, spurious correlations in the input dataset can lead to the system learning random things, which can hamper its predictive power.

Data sets, especially ones that have lots of dimensions, can display correlations that appear at random, but if the input dataset shows enough of these correlations, the system will “learn” them as a pattern, and try to use them in predictions. And the more complicated your model gets, the harder it is to know what it is doing, and thus the harder it is to identify these spurious correlations!

And the thing with having too many “free parameters” (lots of degrees of freedom but without enough data to tie down the parameters) is that these free parameters are especially susceptible to learning the spurious correlations – for they have no other job.

Thinking about it, after all, machine learning systems are not human!

Nested Ternary Operators

It’s nearly twenty years since I first learnt to code, and the choice of language (imposed by school, among others) then was C. One of the most fascinating things about C was what was simply called the “ternary operator”, which is kinda similar to the IF statement in Excel, ifelse statement in R and np.where statement in Python.

Basically the ternary operator consisted of a ‘?’ and a ‘:’. It was a statement that took the form of “if this then that else something else”. So, for example, if you had two variables a and b, and had to return the maximum of them, you could use the ternary operator to say a>b?a:b.

Soon I was attending programming contests, where there would be questions on debugging programs. These would inevitably contain one question on ternary operators. A few years later I started attending job interviews for software engineering positions. The ternary operator questions were still around, except that now it would be common to “nest” ternary operators (include one inside the other). It became a running joke that the only place you’d see nested ternary operators was in software engineering interviews.

The thing with the ternary operator is that while it allows you to write your program in fewer lines of code and make it seem more concise, it makes the code a lot less readable. This in turn makes it hard for people to understand your code, and thus makes it hard to debug. In that sense, using the operator while coding in C is not considered particularly good practice.

It’s nearly 2018 now, and C is not used that much nowadays, so the ternary operator, and the nested ternary operator, have made their exit – even from programming interviews if I’m not wrong. However, people still continue to maintain this practice of writing highly optimised code.

Now, every programmer who thinks he’s a good programmer likes to write efficient code. There’s this sense of elegance about code written in a rather elegant manner, using only a few lines. Sometimes such elegant code is also more efficient, speeding up computation and consuming less memory (think, for example, vectorised operations in R).

The problem, however, is that such elegance comes with a tradeoff with readability. The more optimised a piece of code is, the harder it is for someone else to understand it, and thus the harder it is to debug. And the more complicated the algorithm being coded, the worse it gets.

It makes me think that the reason all those ternary operators used to appear in those software engineering interviews (FYI I’ve never done a software engineering job) is to check if you’re able to read complicated code that others write!

AlphaZero defeats Stockfish: Quick thoughts

The big news of the day, as far as I’m concerned, is the victory of Google Deepmind’s AlphaZero over Stockfish, currently the highest rated chess engine. This comes barely months after Deepmind’s AlphaGo Zero had bested the earlier avatar of AlphaGo in the game of Go.

Like its Go version, the AlphaZero chess playing machine learnt using reinforcement learning (I remember doing a term paper on the concept back in 2003 but have mostly forgotten). Basically it wasn’t given any “training data”, but the machine trained itself on continuously playing with itself, with feedback given in each stage of learning helping it learn better.

After only about four hours of “training” (basically playing against itself and discovering moves), AlphaZero managed to record this victory in a 100-game match, winning 28 and losing none (the rest of the games were drawn).

There’s a sample game here on the Chess.com website and while this might be a biased sample (it’s likely that the AlphaZero engineers included the most spectacular games in their paper, from which this is taken), the way AlphaZero plays is vastly different from the way engines such as Stockfish have been playing.

I’m not that much of a chess expert (I “retired” from my playing career back in 1994), but the striking things for me from this game were

  • the move 7. d5 against the Queen’s Indian
  • The piece sacrifice a few moves later that was hard to see
  • AlphaZero’s consistent attempts until late in the game to avoid trading queens
  • The move Qh1 somewhere in the middle of the game

In a way (and being consistent with some of the themes of this blog), AlphaZero can be described as a “stud” chess machine, having taught itself to play based on feedback from games it’s already played (the way reinforcement learning broadly works is that actions that led to “good rewards” are incentivised in the next iteration, while those that led to “poor rewards” are penalised. The challenge in this case is to set up chess in a way that is conducive for a reinforcement learning system).

Engines such as StockFish, on the other hand, are absolute “fighters”. They get their “power” by brute force, by going down nearly all possible paths in the game several moves down. This is supplemented by analysis of millions of existing games of various levels which the engine “learns” from – among other things, it learns how to prune and prioritise the paths it searches on. StockFish is also fed a database of chess openings which it remembers and tries to play.

What is interesting is that AlphaZero has “discovered” some popular chess openings through the course of is self-learning. It is interesting to note that some popular openings such as the King’s Indian or French find little favour with this engine, while others such as the Queen’s Gambit or the Queen’s Indian find favour. This is a very interesting development in terms of opening theory itself.

Frequency of openings over time employed by AlphaZero in its “learning” phase. Image sourced from AlphaZero research paper.

In any case, my immediate concern from this development is how it will affect human chess. Over the last decade or two, engines such as stockfish have played a profound role in the development of chess, with current top players such as Magnus Carlsen or Sergey Karjakin having trained extensively with these engines.

The way top grandmasters play has seen a steady change in these years as they have ingested the ideas from engines such as StockFish. The game has become far more quiet and positional, as players seek to gain small advantages which steadily improves over the course of (long) games. This is consistent with the way the engines that players learn from play.

Based on the evidence of the one game I’ve seen of AlphaZero, it plays very differently from the existing engines. Based on this, it will be interesting to see how human players who train with AlphaZero based engines (or their clones) will change their game.

Maybe chess will turn back to being a bit more tactical than it’s been in the last decade? It’s hard to say right now!

Algorithms and the Turing Test

One massive concern about the rise of artificial intelligence and machine learning is the perpetuation of human biases. This could be racism (the story, possibly apocryphal, of a black person being tagged as a gorilla) or sexism (see tweet below) or any other forms of discrimination (objective looking data that actually represents certain divisions).

In other words, mainstream concern about artificial intelligence is that it is too human, and such systems should somehow be “cured” of their human biases in order to be fair.

My concern, though, is the opposite. That many of the artificial intelligence and machine learning systems are not “human enough”. In other words, that most present day artificial intelligence and machine learning systems would not pass the Turing Test.

To remind you of the test, here is an extract from Wikipedia:

The Turing test, developed by Alan Turing in 1950, is a test of a machine’s ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversationsbetween a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation is a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel such as a computer keyboard and screen so the result would not depend on the machine’s ability to render words as speech.[2] If the evaluator cannot reliably tell the machine from the human, the machine is said to have passed the test. The test does not check the ability to give correct answers to questions, only how closely answers resemble those a human would give.

The test was introduced by Turing in his paper, “Computing Machinery and Intelligence“, while working at the University of Manchester (Turing, 1950; p. 460).

Think of any recommender system, for example. With some effort, it is easy for a reasonably intelligent human to realise that the recommendations are being made by a machine. Even the most carefully designed recommender systems give away the fact that their intelligence is artificial once in a while.

To take a familiar example, people talk about the joy of discovering books in bookshops, and about the quality of recommendations given by an expert bookseller who gets his customers. Now, Amazon perhaps collects more data about its customers than any such bookseller, and uses them to recommend books. However, even a little scrolling reveals that the recommendations are rather mechanical and predictable.

It’s similar with my recommendations on Netflix – after a point you know the mechanics behind them.

In some sense this predictability is because the designers possibly think it’s a good thing – Netflix, for example, tells you why it has recommended a particular video. The designers of these algorithms possibly think that explaining their decisions might given their human customers more reason to trust them.

(As an aside, it is common for people to rant against the “opaque” algorithms that drive systems as diverse as Facebook’s News Feed and Uber’s Surge Pricing. So perhaps some algorithm designers do see reason in wanting to explain themselves).

The way I see it, though, by attempting to explain themselves these algorithms are giving themselves away, and willingly failing the Turing test. Whenever recommendations sound purely mechanical, there is reason for people to start trusting them less. And when equally mechanical reasons are given for these mechanical recommendations, the desire to trust the recommendations falls further.

If I were to design a recommendation system, I’d introduce some irrationality, some hard-to-determine randomness to try make the customer believe that there is actually a person behind them. I believe it is a necessary condition for recommendations to become truly personalised!

Python and Hindi

So I’ve recently discovered that using Python to analyse data is, to me, like talking in Hindi. Let me explain.

Back in 2008-9 I lived in Delhi, where the only language spoken was Hindi. Now, while I’ve learnt Hindi formally in school (I got 90 out of 100 in my 10th boards!), and watched plenty of Hindi movies, I’ve never been particularly fluent in the language.

The basic problem is that I don’t know the language well enough to think in it. So when I’m talking Hindi, I usually think in Kannada and then translate my thoughts. This means my speech is slow – even Atal Behari Vajpayee can speak Hindi faster than me.

More importantly, thinking in Kannada and translating means that I can get several idioms wrong (can’t think of particular examples now). And I end up using the language in ways that native speakers don’t (again can’t think of examples here).

I recently realised it’s the same with programming languages. For some 7 years now I’ve mostly used R for data analysis, and have grown super comfortable with it. However, at work nowadays I’m required to use Python for my analysis, to ensure consistency with the rest of the firm.

While I’ve grown reasonably comfortable with using Python over the last few months, I realise that I have the same Hindi problem. I simply can’t think in Python. Any analysis I need to do, I think about it in R terms, and then mentally translate the code before performing it in Python.

This results in several inefficiencies. Firstly, the two languages are constructed differently and optimised for different things. When I think in one language and mentally translate the code to the other, I’m exploiting the efficiencies of the thinking language rather than the efficiencies of the coding language.

Then, the translation process itself can be ugly. What might be one line of code in R can sometimes take 15 lines in Python (and vice versa). So I end up writing insanely verbose code that is hard to read.

Such code also looks ugly – a “native user” of the language finds it rather funnily written, and will find it hard to read.

A decade ago, after a year of struggling in Delhi, I packed my bags and moved back to Bangalore, where I could both think and speak in Kannada. Wonder what this implies in a programming context!