AlphaZero Revisited

It’s been over a year since Google’s DeepMind first made its splash with the reinforcement-learning based chess playing engine AlphaZero. The first anniversary of the story of AlphaZero being released also coincided with the publication of the peer-reviewed paper.

To go with the peer-reviewed paper, DeepMind has released a further 200 games played between AlphaZero and the conventional chess engine StockFish, which is again heavily loaded in favour of wins for AlphaZero, but also contains 6 game where AlphaZero lost. I’ve been following these games on GM Daniel King’s excellent Powerplaychess channel, and want to revise my opinion on AlphaZero.

Back then, I had looked at AlphaZero’s play from my favourite studs and fighter framework, which in hindsight doesn’t do full justice to AlphaZero. From the games that I’ve seen from the set released this season, AlphaZero’s play hasn’t exactly been “stud”. It’s just that it’s much more “human”. And the reason why AlphaZero’s play possibly seems more human is because of the way it “learns”.

Conventional chess engines evaluate a position by considering all possible paths (ok not really, they use an intelligent method called Alpha-Beta Pruning to limit their search size), and then play the move that leads to the best position at the end of the search. These engines use “pre-learnt human concepts” such as point count for different pieces, which are used to evaluate positions. And this leads to a certain kind of play.

AlphaZero’s learning, process, however, involves playing zillions of games against itself (since I wrote that previous post, I’ve come back up to speed with reinforcement learning). And then based on the results of these games, it evaluates positions it reached in the course of play (in hindsight). On top of this, it builds a deep learning model to identify the goodness of positions.

Given my limited knowledge of how deep learning works, this process involves AlphaZero learning about “features” of games that have more often than not enabled it to win. So somewhere in the network there will be a node that represents “control of centre”. Another node deep in the network might represent “safety of king”. Yet another might perhaps involve “open A file”.

Of course, none of these features have been pre-specified to AlphaZero. It has simply learnt it by training its neural network on zillions of games it has played against itself. And while deep learning is hard to “explain”, it is likely to have so happened that the features of the game that AlphaZero has learnt are remarkably similar to the “features” of the game that human players have learnt over the centuries. And it is because of the commonality in these features that we find AlphaZero’s play so “human”.

Another way to look at is from the concept of “10000 hours” that Malcolm Gladwell spoke about in his book Outliers. As I had written in my review of the book, the concept of 10000 hours can be thought of as “putting fight until you get enough intuition to become stud”. AlphaZero, thanks to its large number of processors, has effectively spent much more than “10000 hours” playing against itself, with its neural network constantly “learning” from the positions faced and the outcomes of the game reached. And this way, it has “gained intuition” over features of the game that lead to wins, giving it an air of “studness”.

The interesting thing to me about AlphaZero’s play is that thanks to its “independent development” (in a way like the Finches of Galapagos), it has not been burdened by human intuition on what is good or bad, and learnt its own heuristics. And along the way, it has come up with a bunch of heuristics that have not commonly be used by human players.

Keeping bishops on the back rank (once the rooks have been connected), for example. A stronger preference for bishops to knights than humans. Suddenly simplifying from a terrifying-looking attack into a winning endgame (machines are generally good at endgames, so this is not that surprising). Temporary pawn and piece sacrifices. And all that.

Thanks to engines such as LeelaZero, we can soon see the results of these learnings being applied to human chess as well. And human chess can only become better!

Hypothesis Testing in Monte Carlo

I find it incredible, and not in a good way, that I took fourteen years to make the connection between two concepts I learnt barely a year apart.

In August-September 2003, I was auditing an advanced (graduate) course on Advanced Algorithms, where we learnt about randomised algorithms (I soon stopped auditing since the maths got heavy). And one important class of randomised algorithms is what is known as “Monte Carlo Algorithms”. Not to be confused with Monte Carlo Simulations, these are randomised algorithms that give a one way result. So, using the most prominent example of such an algorithm, you can ask “is this number prime?” and the answer to that can be either “maybe” or “no”.

The randomised algorithm can never conclusively answer “yes” to the primality question. If the algorithm can find a prime factor of the number, it answers “no” (this is conclusive). Otherwise it returns “maybe”. So the way you “conclude” that a number is prime is by running the test a large number of times. Each run reduces the probability that it is a “no” (since they’re all independent evaluations of “maybe”), and when the probability of “no” is low enough, you “think” it’s a “yes”. You might like this old post of mine regarding Monte Carlo algorithms in the context of romantic relationships.

Less than a year later, in July 2004, as part of a basic course in statistics, I learnt about hypothesis testing. Now (I’m kicking myself for failing to see the similarity then), the main principle of hypothesis testing is that you can never “accept a hypothesis”. You either reject a hypothesis or “fail to reject” it.  And if you fail to reject a hypothesis with a certain high probability (basically with more data, which implies more independent evaluations that don’t say “reject”), you will start thinking about “accept”.

Basically hypothesis testing is a one-sided  test, where you are trying to reject a hypothesis. And not being able to reject a hypothesis doesn’t mean we necessarily accept it – there is still the chance of going wrong if we were to accept it (this is where we get into messy territory such as p-values). And this is exactly like Monte Carlo algorithms – one-sided algorithms where we can only conclusively take a decision one way.

So I was thinking of these concepts when I came across this headline in ESPNCricinfo yesterday that said “Rahul Johri not found guilty” (not linking since Cricinfo has since changed the headline). The choice, or rather ordering, of words was interesting. “Not found guilty”, it said, rather than the usual “found not guilty”.

This is again a concept of one-sided testing. An investigation can either find someone guilty or it fails to do so, and the heading in this case suggested that the latter had happened. And as a deliberate choice, it became apparent why the headline was constructed this way – later it emerged that the decision to clear Rahul Johri of sexual harassment charges was a contentious one.

In most cases, when someone is “found not guilty” following an investigation, it usually suggests that the evidence on hand was enough to say that the chance of the person being guilty was rather low. The phrase “not found guilty”, on the other hand, says that one test failed to reject the hypothesis, but it didn’t have sufficient confidence to clear the person of guilt.

So due credit to the Cricinfo copywriters, and due debit to the product managers for later changing the headline rather than putting a fresh follow-up piece.

PS: The discussion following my tweet on the topic threw up one very interesting insight – such as Scotland having had a “not proven” verdict in the past for such cases (you can trust DD for coming up with such gems).

Human, Animal and Machine Intelligence

Earlier this week I started watching this series on Netflix called “Terrorism Close Calls“. Each episode is about an instance of attempted terrorism that has been foiled in the last 2 decades. For example, there is one example of the plot to bomb a set of transatlantic flights from London to North America in 2006 (a consequence of which is that liquids still aren’t allowed on board flights).

So the first episode of the series involves this Afghani guy who drives all the way from Colorado to New York to place a series of bombs in the latter’s subways (metro train system). He is under surveillance through the length of his journey, and just as he is about to enter New York, he is stopped for what seems like a “routine drugs test”.

As the episode explains, “a set of dogs went around his car sniffing”, but “rather than being trained to sniff drugs” (as is routine in such a stop), “these dogs had been trained to sniff explosives”.

This little snippet got me thinking about how machines are “trained” to “learn”. At the most basic level, machine learning involves showing a large number of “positive cases” and “negative cases” based on which the program “learns” the differences between the positive and negative cases, and thus to identify the positive cases.

So if you want to built a system to identify cats in an image, you feed the machine a large number of images with cats in them, and a large(r) number of images without cats in them, each appropriately “labelled” (“cat” or “no cat”) and based on the differences, the system learns to identify cats.

Similarly, if you want to teach a system to detect cancers based on MRIs, you show it a set of MRIs that show malignant tumours, and another set of MRIs without malignant tumours, and sure enough the machine learns to distinguish between the two sets (you might have come across claims of “AI can cure cancer”. This is how it does it).

However, AI can sometimes go wrong by learning the wrong things. For example, an algorithm trained to recognise sheep started classifying grass as “sheep” (since most of the positive training samples had sheep in meadows). Another system went crazy in its labelling when an unexpected object (an elephant in a drawing room) was present in the picture.

While machines learn through lots of positive and negative examples, that is not how humans learn, as I’ve been observing as my daughter grows up. When she was very little, we got her a book with one photo each of 100 different animals. And we would sit with her every day pointing at each picture and telling her what each was.

Soon enough, she could recognise cats and dogs and elephants and tigers. All by means of being “trained on” one image of each such animal. Soon enough, she could recognise hitherto unseen pictures of cats and dogs (and elephants and tigers). And then recognise dogs (as dogs) as they passed her on the street. What absolutely astounded me was that she managed to correctly recognise a cartoon cat, when all she had seen thus far were “real cats”.

So where do animals stand, in this spectrum of human to machine learning? Do they recognise from positive examples only (like humans do)? Or do they learn from a combination of positive and negative examples (like machines)? One thing that limits the positive-only learning for animals is the limited range of their communication.

What drives my curiosity is that they get trained for specific things – that you have dogs to identify drugs and dogs to identify explosives. You don’t usually have dogs that can recognise both (specialisation is for insects, as they say – or maybe it’s for all non-human animals).

My suspicion (having never had a pet) is that the way animals learn is closer to how humans learn – based on a large number of positive examples, rather than as the difference between positive and negative examples. Just that the limit of the animal’s communication being limited means that it is hard to train them for more than one thing (or maybe there’s something to do with their mental bandwidth as well. I don’t know).

What do you think? Interestingly enough, there is a recent paper that talks about how many machine learning systems have “animal-like abilities” rather than coming close to human intelligence.

For millions of years, mankind lived, just like the animals.
And then something happened that unleashed the power of our imagination. We learned to talk
– Stephen Hawking, in the opening of a Roger Waters-less Pink Floyd’s Keep Talking

Programming Languages

I take this opportunity to apologise for my prior belief that all that matters is thinking algorithmically, and language in which the ideas are expressed doesn’t matter.

About a decade ago, I used to make fun of information technology company that hired developers based on the language they coded in. My contention was that writing code is a skill that you either have or you don’t, and what a potential employer needs to look for is the ability to think algorithmically, and then render ideas in code. 

While I’ve never worked as a software engineer I find myself writing more and more code over the years as a part of doing data analysis. The primary tool I use is R, where coding doesn’t really feel like coding, since it is a rather high level language. However, I’m occasionally asked to show code in Python, since some clients are more proficient in that, and the one thing that has done is to teach me the value of domain knowledge of a programming language. 

I take this opportunity to apologise for my prior belief that all that matters is thinking algorithmically, and language in which the ideas are expressed doesn’t matter. 

This is because the language you usually program in subtly nudges you towards thinking in a particular way. Having mostly used R over the last decade, I think in terms of tables and data frames, and after having learnt tidyverse earlier this year, my way of thinking algorithmically has become in a weird way “object oriented” (no, this has nothing to do with classes). I take an “object” (a data frame) and then manipulate it in various ways, changing it, summarising stuff, calculating things on the fly and aggregating, until the point where the result comes out in an elegant manner. 

And while Pandas allows chaining (in fact, it is from Pandas that I suspect the tidyverse guys got the idea for the “%>%” chaining operator), it is by no means as complete in its treatment of chaining as R, and that that makes things tricky. 

Moreover, being proficient in R makes you think in terms of vectorised operations, and when you see that python doesn’t necessarily offer that, and and operations that were once simple in R are now rather complicated in Python, using list comprehension and what not. 

Putting it another way, thinking algorithmically in the framework offered by one programming language makes it rather stressful to express these thoughts in another language where the way of algorithmic thinking is rather different. 

For example, I’ve never got the point of the index in pandas dataframes, and I only find myself “resetting” it constantly so that my way of addressing isn’t mangled. Compared to the intuitive syntax in R, which is first and foremost a data analysis tool, and where the data frame is “native”, the programming language approach of python with its locs and ilocs is again irritating. 

I can go on… 

And I’m guessing this feeling is mutual – someone used to doing things the python way would find R’s syntax and way of doing things rather irritating. R’s machine learning toolkit for example is nowhere as easy as scikit learn is in python (this doesn’t affect me since I seldom need to use machine learning. For example, I use regression less than 5% of the time in my work). 

The next time I see a job opening for a “java developer” I will not laugh like I used to ten years ago. I know that this posting is looking for a developer who can not only think algorithmically, but also algorithmically in the way that is most convenient to express in Java. And unlearning one way of algorithmic thinking and learning another isn’t particularly easy. 

Statistics and machine learning

So a group of statisticians (from Cyprus and Greece) have written an easy-to-read paper comparing statistical and machine learning methods in time series forecasting, and found that statistical methods do better, both in terms of accuracy and computational complexity.

To me, there’s no surprise in the conclusion, since in the statistical methods, there is some human intelligence involved, in terms of removing seasonality, making the time series stationary and then using statistical methods that have been built specifically for time series forecasting (including some incredibly simple stuff like exponential smoothing).

Machine learning methods, on the other hand, are more general purpose – the same neural networks used for forecasting these time series, with changed parameters, can be used for predicting something else.

In a way, using machine learning for time series forecasting is like using that little screwdriver from a Swiss army knife, rather than a proper screwdriver. Yes, it might do the job, but it’s in general inefficient and not an effective use of resources.

Yet, it is important that this paper has been written since the trend in industry nowadays has been that given cheap computing power, machine learning be used for pretty much any problem, irrespective of whether it is the most appropriate method for doing so. You also see the rise of “machine learning purists” who insist that no human intelligence should “contaminate” these models, and machines should do everything.

By pointing out that statistical techniques are superior at time series forecasting compared to general machine learning techniques, the authors bring to attention that using purpose-built techniques can actually do much better, and that we can build better systems by using a combination of human and machine intelligence.

They also helpfully include this nice picture that summarises what machine learning is good for, and I wholeheartedly agree: 

The paper also has some other gems. A few samples here:

Knowing that a certain sophisticated method is not as accurate as a much simpler one is upsetting from a scientific point of view as the former requires a great deal of academic expertise and ample computer time to be applied.

 

[…] the post-sample predictions of simple statistical methods were found to be at least as accurate as the sophisticated ones. This finding was furiously objected to by theoretical statisticians [76], who claimed that a simple method being a special case of e.g. ARIMA models, could not be more accurate than the ARIMA one, refusing to accept the empirical evidence proving the opposite.

 

A problem with the academic ML forecasting literature is that the majority of published studies provide forecasts and claim satisfactory accuracies without comparing them with simple statistical methods or even naive benchmarks. Doing so raises expectations that ML methods provide accurate predictions, but without any empirical proof that this is the case.

 

At present, the issue of uncertainty has not been included in the research agenda of the ML field, leaving a huge vacuum that must be filled as estimating the uncertainty in future predictions is as important as the forecasts themselves.

Beer and diapers: Netflix edition

When we started using Netflix last May, we created three personas for the three of us in the family – “Karthik”, “Priyanka” and “Berry”. At that time we didn’t realise that there was already a pre-created “kids” (subsequently renamed “children” – don’t know why that happened) persona there.

So while Priyanka and I mostly use our respective personas to consume Netflix (our interests in terms of video content hardly intersect), Berry uses both her profile and the kids profile for her stuff (of course, she’s too young to put it on herself. We do it for her). So over the year, the “Berry” profile has been mostly used to play Peppa Pig, and the occasional wildlife documentary.

Which is why we were shocked the other day to find that “Real life wife swap” had been recommended on her account. Yes, you read that right. We muttered a word of abuse about Netflix’s machine learning algorithms and since then have only used the “kids” profile to play Berry’s stuff.

Since then I’ve been wondering what made Netflix recommend “real life wife swap” to Berry. Surely, it would have been clear to Netflix that while it wasn’t officially classified as one, the Berry persona was a kid’s account? And even if it didn’t, didn’t the fact that the account was used for watching kids’ stuff lead the collaborative filtering algorithms at Netflix to recommend more kids’ stuff? I’ve come up with various hypotheses.

Since I’m not Netflix, and I don’t have their data, I can’t test it, but my favourite hypothesis so far involves what is possibly the most commonly cited example in retail analytics – “beer and diapers“. In this most-likely-apocryphal story, a supermarket chain discovered that beer and diapers were highly likely to appear together in shopping baskets. Correlation led to causation and a hypothesis was made that this was the result of tired fathers buying beer on their diaper shopping trips.

So the Netflix version of beer-and-diapers, which is my hypothesis, goes like this. Harrowed parents are pestered by their kids to play Peppa Pig and other kiddie stuff. The parents are so stressed that they don’t switch to the kid’s persona, and instead play Peppa Pig or whatever from their own accounts. The kid is happy and soon goes to bed. And then the parent decides to unwind by watching some raunchy stuff like “real life wife swap”.

Repeat this story in enough families, and you have a strong enough pattern that accounts not explicitly classified as “kids/children” have strong activity of both kiddie stuff and adult content. And when you use an account not explicitly mentioned as “kids” to watch kiddie stuff, it gets matched to these accounts that have created the pattern – Netflix effectively assumes that watching kid stuff on an adult account indicates that the same account is used to watch adult content as well. And so serves it to Berry!

Machine learning algorithms basically work on identifying patterns in data, and then fitting these patterns on hitherto unseen data. Sometimes the patterns make sense – like Google Photos identifying you even in your kiddie pics. Other times, the patterns are offensive – like the time Google Photos classified a black woman as a “gorilla“.

Thus what is necessary is some level of human oversight, to make sure that the patterns the machine has identified makes some sort of sense (machine learning purists say this is against the spirit of machine learning, since one of the purposes of machine learning is to discover patterns not perceptible to humans).

That kind of oversight at Netflix would have suggested that you can’t tag a profile to a “kiddie content AND adult content” category if the profile has been used to watch ONLY kiddie content (or ONLY adult content). And that kind of oversight would have also led Netflix to investigate issues of users using “general” account for their kids, and coming up with an algorithm to classify such accounts as kids’ accounts, and serve only kids’ content there.

It seems, though, that algorithms run supreme at Netflix, and so my baby daughter gets served “real life wife swap”. Again, this is all a hypothesis (real life wife swap being recommended is a fact, of course)!

Algorithmic curation

When I got my first smartphone (a Samsung Galaxy Note 2) in 2013, one of the first apps I installed on it was Flipboard. I’d seen the app while checking out some phones at either the Apple or Samsung retail outlets close to my home, and it seemed like a rather interesting idea.

For a long time, Flipboard was my go-to app to check the day’s news, as it conveniently categorised news into “tech”, “business” and “sport” and learnt about my preferences and fed me stuff I wanted. And then after some update, it suddenly stopped working – somehow it started serving too much stuff I didn’t want to read about, and when I tuned (by “following” and “unfollowing” topics) my feed, it progressively got worse.

I stopped using it some 2 years back, but out of curiosity started using it again recently. While it did throw up some nice articles, there is too much unwanted stuff in the app. More precisely, there’s a lot of “clickbaity” stuff (“10 things about Narendra Modi you would never want to know” and the like) in my feed, meaning I have to wade through a lot of such articles to find the occasional good ones.

(Aside: I dedicate about half a chapter to this phenomenon in my book. The technical term is “congestion”. I talk about it in the context of markets in relationships and real estate)

Flipboard is not the only one. I use this app called Pocket to bookmark long articles and read later. A couple of years back, Pocket started giving “recommendations” based on what I’d read and liked. Initially it was good, and mostly curated from what my “friends” on Pocket recommended. Now, increasingly I’m getting clickbaity stuff again.

I stopped using Facebook a long time before they recently redesigned their newsfeed (to give more weight to friends’ stuff than third party news), but I suspect that one of the reasons they made the change was the same – the feed was getting overwhelmed with clickbaity stuff, which people liked but didn’t really read.

Basically, there seems to be a widespread problem in a lot of automatically curated news feeds. To put it another way, the clickbaity websites seem to have done too well in terms of gaming whatever algorithms the likes of Facebook, Flipboard and Pocket use to build their automated recommendations.

And more worryingly, with all these curators starting to do badly around the same time (ok this is my empirical observation. Given few data points I might be wrong), it suggests that all automated curation algorithms use a very similar algorithm! And that can’t be a good thing.