## Human, Animal and Machine Intelligence

Earlier this week I started watching this series on Netflix called “Terrorism Close Calls“. Each episode is about an instance of attempted terrorism that has been foiled in the last 2 decades. For example, there is one example of the plot to bomb a set of transatlantic flights from London to North America in 2006 (a consequence of which is that liquids still aren’t allowed on board flights).

So the first episode of the series involves this Afghani guy who drives all the way from Colorado to New York to place a series of bombs in the latter’s subways (metro train system). He is under surveillance through the length of his journey, and just as he is about to enter New York, he is stopped for what seems like a “routine drugs test”.

As the episode explains, “a set of dogs went around his car sniffing”, but “rather than being trained to sniff drugs” (as is routine in such a stop), “these dogs had been trained to sniff explosives”.

This little snippet got me thinking about how machines are “trained” to “learn”. At the most basic level, machine learning involves showing a large number of “positive cases” and “negative cases” based on which the program “learns” the differences between the positive and negative cases, and thus to identify the positive cases.

So if you want to built a system to identify cats in an image, you feed the machine a large number of images with cats in them, and a large(r) number of images without cats in them, each appropriately “labelled” (“cat” or “no cat”) and based on the differences, the system learns to identify cats.

Similarly, if you want to teach a system to detect cancers based on MRIs, you show it a set of MRIs that show malignant tumours, and another set of MRIs without malignant tumours, and sure enough the machine learns to distinguish between the two sets (you might have come across claims of “AI can cure cancer”. This is how it does it).

However, AI can sometimes go wrong by learning the wrong things. For example, an algorithm trained to recognise sheep started classifying grass as “sheep” (since most of the positive training samples had sheep in meadows). Another system went crazy in its labelling when an unexpected object (an elephant in a drawing room) was present in the picture.

While machines learn through lots of positive and negative examples, that is not how humans learn, as I’ve been observing as my daughter grows up. When she was very little, we got her a book with one photo each of 100 different animals. And we would sit with her every day pointing at each picture and telling her what each was.

Soon enough, she could recognise cats and dogs and elephants and tigers. All by means of being “trained on” one image of each such animal. Soon enough, she could recognise hitherto unseen pictures of cats and dogs (and elephants and tigers). And then recognise dogs (as dogs) as they passed her on the street. What absolutely astounded me was that she managed to correctly recognise a cartoon cat, when all she had seen thus far were “real cats”.

So where do animals stand, in this spectrum of human to machine learning? Do they recognise from positive examples only (like humans do)? Or do they learn from a combination of positive and negative examples (like machines)? One thing that limits the positive-only learning for animals is the limited range of their communication.

What drives my curiosity is that they get trained for specific things – that you have dogs to identify drugs and dogs to identify explosives. You don’t usually have dogs that can recognise both (specialisation is for insects, as they say – or maybe it’s for all non-human animals).

My suspicion (having never had a pet) is that the way animals learn is closer to how humans learn – based on a large number of positive examples, rather than as the difference between positive and negative examples. Just that the limit of the animal’s communication being limited means that it is hard to train them for more than one thing (or maybe there’s something to do with their mental bandwidth as well. I don’t know).

What do you think? Interestingly enough, there is a recent paper that talks about how many machine learning systems have “animal-like abilities” rather than coming close to human intelligence.

For millions of years, mankind lived, just like the animals.
And then something happened that unleashed the power of our imagination. We learned to talk
– Stephen Hawking, in the opening of a Roger Waters-less Pink Floyd’s Keep Talking

## Programming Languages

I take this opportunity to apologise for my prior belief that all that matters is thinking algorithmically, and language in which the ideas are expressed doesn’t matter.

About a decade ago, I used to make fun of information technology company that hired developers based on the language they coded in. My contention was that writing code is a skill that you either have or you don’t, and what a potential employer needs to look for is the ability to think algorithmically, and then render ideas in code.

While I’ve never worked as a software engineer I find myself writing more and more code over the years as a part of doing data analysis. The primary tool I use is R, where coding doesn’t really feel like coding, since it is a rather high level language. However, I’m occasionally asked to show code in Python, since some clients are more proficient in that, and the one thing that has done is to teach me the value of domain knowledge of a programming language.

I take this opportunity to apologise for my prior belief that all that matters is thinking algorithmically, and language in which the ideas are expressed doesn’t matter.

This is because the language you usually program in subtly nudges you towards thinking in a particular way. Having mostly used R over the last decade, I think in terms of tables and data frames, and after having learnt tidyverse earlier this year, my way of thinking algorithmically has become in a weird way “object oriented” (no, this has nothing to do with classes). I take an “object” (a data frame) and then manipulate it in various ways, changing it, summarising stuff, calculating things on the fly and aggregating, until the point where the result comes out in an elegant manner.

And while Pandas allows chaining (in fact, it is from Pandas that I suspect the tidyverse guys got the idea for the “%>%” chaining operator), it is by no means as complete in its treatment of chaining as R, and that that makes things tricky.

Moreover, being proficient in R makes you think in terms of vectorised operations, and when you see that python doesn’t necessarily offer that, and and operations that were once simple in R are now rather complicated in Python, using list comprehension and what not.

Putting it another way, thinking algorithmically in the framework offered by one programming language makes it rather stressful to express these thoughts in another language where the way of algorithmic thinking is rather different.

For example, I’ve never got the point of the index in pandas dataframes, and I only find myself “resetting” it constantly so that my way of addressing isn’t mangled. Compared to the intuitive syntax in R, which is first and foremost a data analysis tool, and where the data frame is “native”, the programming language approach of python with its locs and ilocs is again irritating.

I can go on…

And I’m guessing this feeling is mutual – someone used to doing things the python way would find R’s syntax and way of doing things rather irritating. R’s machine learning toolkit for example is nowhere as easy as scikit learn is in python (this doesn’t affect me since I seldom need to use machine learning. For example, I use regression less than 5% of the time in my work).

The next time I see a job opening for a “java developer” I will not laugh like I used to ten years ago. I know that this posting is looking for a developer who can not only think algorithmically, but also algorithmically in the way that is most convenient to express in Java. And unlearning one way of algorithmic thinking and learning another isn’t particularly easy.

## Statistics and machine learning

So a group of statisticians (from Cyprus and Greece) have written an easy-to-read paper comparing statistical and machine learning methods in time series forecasting, and found that statistical methods do better, both in terms of accuracy and computational complexity.

To me, there’s no surprise in the conclusion, since in the statistical methods, there is some human intelligence involved, in terms of removing seasonality, making the time series stationary and then using statistical methods that have been built specifically for time series forecasting (including some incredibly simple stuff like exponential smoothing).

Machine learning methods, on the other hand, are more general purpose – the same neural networks used for forecasting these time series, with changed parameters, can be used for predicting something else.

In a way, using machine learning for time series forecasting is like using that little screwdriver from a Swiss army knife, rather than a proper screwdriver. Yes, it might do the job, but it’s in general inefficient and not an effective use of resources.

Yet, it is important that this paper has been written since the trend in industry nowadays has been that given cheap computing power, machine learning be used for pretty much any problem, irrespective of whether it is the most appropriate method for doing so. You also see the rise of “machine learning purists” who insist that no human intelligence should “contaminate” these models, and machines should do everything.

By pointing out that statistical techniques are superior at time series forecasting compared to general machine learning techniques, the authors bring to attention that using purpose-built techniques can actually do much better, and that we can build better systems by using a combination of human and machine intelligence.

They also helpfully include this nice picture that summarises what machine learning is good for, and I wholeheartedly agree:

The paper also has some other gems. A few samples here:

Knowing that a certain sophisticated method is not as accurate as a much simpler one is upsetting from a scientific point of view as the former requires a great deal of academic expertise and ample computer time to be applied.

[…] the post-sample predictions of simple statistical methods were found to be at least as accurate as the sophisticated ones. This finding was furiously objected to by theoretical statisticians [76], who claimed that a simple method being a special case of e.g. ARIMA models, could not be more accurate than the ARIMA one, refusing to accept the empirical evidence proving the opposite.

A problem with the academic ML forecasting literature is that the majority of published studies provide forecasts and claim satisfactory accuracies without comparing them with simple statistical methods or even naive benchmarks. Doing so raises expectations that ML methods provide accurate predictions, but without any empirical proof that this is the case.

At present, the issue of uncertainty has not been included in the research agenda of the ML field, leaving a huge vacuum that must be filled as estimating the uncertainty in future predictions is as important as the forecasts themselves.

## Beer and diapers: Netflix edition

When we started using Netflix last May, we created three personas for the three of us in the family – “Karthik”, “Priyanka” and “Berry”. At that time we didn’t realise that there was already a pre-created “kids” (subsequently renamed “children” – don’t know why that happened) persona there.

So while Priyanka and I mostly use our respective personas to consume Netflix (our interests in terms of video content hardly intersect), Berry uses both her profile and the kids profile for her stuff (of course, she’s too young to put it on herself. We do it for her). So over the year, the “Berry” profile has been mostly used to play Peppa Pig, and the occasional wildlife documentary.

Which is why we were shocked the other day to find that “Real life wife swap” had been recommended on her account. Yes, you read that right. We muttered a word of abuse about Netflix’s machine learning algorithms and since then have only used the “kids” profile to play Berry’s stuff.

Since then I’ve been wondering what made Netflix recommend “real life wife swap” to Berry. Surely, it would have been clear to Netflix that while it wasn’t officially classified as one, the Berry persona was a kid’s account? And even if it didn’t, didn’t the fact that the account was used for watching kids’ stuff lead the collaborative filtering algorithms at Netflix to recommend more kids’ stuff? I’ve come up with various hypotheses.

Since I’m not Netflix, and I don’t have their data, I can’t test it, but my favourite hypothesis so far involves what is possibly the most commonly cited example in retail analytics – “beer and diapers“. In this most-likely-apocryphal story, a supermarket chain discovered that beer and diapers were highly likely to appear together in shopping baskets. Correlation led to causation and a hypothesis was made that this was the result of tired fathers buying beer on their diaper shopping trips.

So the Netflix version of beer-and-diapers, which is my hypothesis, goes like this. Harrowed parents are pestered by their kids to play Peppa Pig and other kiddie stuff. The parents are so stressed that they don’t switch to the kid’s persona, and instead play Peppa Pig or whatever from their own accounts. The kid is happy and soon goes to bed. And then the parent decides to unwind by watching some raunchy stuff like “real life wife swap”.

Repeat this story in enough families, and you have a strong enough pattern that accounts not explicitly classified as “kids/children” have strong activity of both kiddie stuff and adult content. And when you use an account not explicitly mentioned as “kids” to watch kiddie stuff, it gets matched to these accounts that have created the pattern – Netflix effectively assumes that watching kid stuff on an adult account indicates that the same account is used to watch adult content as well. And so serves it to Berry!

Machine learning algorithms basically work on identifying patterns in data, and then fitting these patterns on hitherto unseen data. Sometimes the patterns make sense – like Google Photos identifying you even in your kiddie pics. Other times, the patterns are offensive – like the time Google Photos classified a black woman as a “gorilla“.

Thus what is necessary is some level of human oversight, to make sure that the patterns the machine has identified makes some sort of sense (machine learning purists say this is against the spirit of machine learning, since one of the purposes of machine learning is to discover patterns not perceptible to humans).

That kind of oversight at Netflix would have suggested that you can’t tag a profile to a “kiddie content AND adult content” category if the profile has been used to watch ONLY kiddie content (or ONLY adult content). And that kind of oversight would have also led Netflix to investigate issues of users using “general” account for their kids, and coming up with an algorithm to classify such accounts as kids’ accounts, and serve only kids’ content there.

It seems, though, that algorithms run supreme at Netflix, and so my baby daughter gets served “real life wife swap”. Again, this is all a hypothesis (real life wife swap being recommended is a fact, of course)!

## Algorithmic curation

When I got my first smartphone (a Samsung Galaxy Note 2) in 2013, one of the first apps I installed on it was Flipboard. I’d seen the app while checking out some phones at either the Apple or Samsung retail outlets close to my home, and it seemed like a rather interesting idea.

For a long time, Flipboard was my go-to app to check the day’s news, as it conveniently categorised news into “tech”, “business” and “sport” and learnt about my preferences and fed me stuff I wanted. And then after some update, it suddenly stopped working – somehow it started serving too much stuff I didn’t want to read about, and when I tuned (by “following” and “unfollowing” topics) my feed, it progressively got worse.

I stopped using it some 2 years back, but out of curiosity started using it again recently. While it did throw up some nice articles, there is too much unwanted stuff in the app. More precisely, there’s a lot of “clickbaity” stuff (“10 things about Narendra Modi you would never want to know” and the like) in my feed, meaning I have to wade through a lot of such articles to find the occasional good ones.

(Aside: I dedicate about half a chapter to this phenomenon in my book. The technical term is “congestion”. I talk about it in the context of markets in relationships and real estate)

Flipboard is not the only one. I use this app called Pocket to bookmark long articles and read later. A couple of years back, Pocket started giving “recommendations” based on what I’d read and liked. Initially it was good, and mostly curated from what my “friends” on Pocket recommended. Now, increasingly I’m getting clickbaity stuff again.

I stopped using Facebook a long time before they recently redesigned their newsfeed (to give more weight to friends’ stuff than third party news), but I suspect that one of the reasons they made the change was the same – the feed was getting overwhelmed with clickbaity stuff, which people liked but didn’t really read.

Basically, there seems to be a widespread problem in a lot of automatically curated news feeds. To put it another way, the clickbaity websites seem to have done too well in terms of gaming whatever algorithms the likes of Facebook, Flipboard and Pocket use to build their automated recommendations.

And more worryingly, with all these curators starting to do badly around the same time (ok this is my empirical observation. Given few data points I might be wrong), it suggests that all automated curation algorithms use a very similar algorithm! And that can’t be a good thing.

## Stirring the pile efficiently

Warning: This is a technical post, and involves some code, etc.

As I’ve ranted a fair bit on this blog over the last year, a lot of “machine learning” in the industry can be described as “stirring the pile”. Regular readers of this blog will be familiar with this image from XKCD by now:

Basically people simply take datasets and apply all the machine learning techniques they have heard of (implementation is damn easy – scikit learn allows you to implement just about any model in three similar looking lines of code; See my code here to see how similar the implementation is).

So I thought I’ll help these pile-stirrers by giving some hints of what method to use for different kinds of data. I’ve over-simplified stuff, and so assume that:

1. There are two predictor variables X and Y. The predicted variable “Z” is binary.
2. X and Y are each drawn from a standard normal distribution.
3. The predicted variable Z is “clean” – there is a region in the X-Y plane where Z is always “true” and another region where Z is always “false”
4. So the idea is to see which machine learning techniques are good at identifying which kind of geometrical figures.
5. Everything is done “in-sample”. Given the nature of the data, it doesn’t matter if we do it in-sample or out-of-sample.

For those that understand Python (and every pile-stirrer worth his salt is excellent at Python), I’ve put my code in a nice Jupyter Notebook, which can be found here.

So this is what the output looks like. The top row shows the “true values” of Z. Then we have a row for each of the techniques we’ve used, which shows how well these techniques can identify the pattern given in the top row (click on the image for full size).

As you can see, I’ve chosen some common geometrical shapes and seen which methods are good at identifying those. A few pertinent observations:

1. Logistic regression and linear SVM are broadly similar, and both are shit for this kind of dataset. Being linear models, they fail to deal with non-linear patterns
2. SVM with RBF kernel is better, but it fails when there are multiple “true regions” in the dataset. At least it’s good at figuring out some non-linear patterns. However, it can’t figure out the triangle or square – it draws curves around them, instead.
3. Naive Bayesian (I’ve never understood this even though I’m pretty good at Bayesian statistics, but I understand this is a commonly used technique; and I’ve used default parameters so not sure how it is “Bayesian” even) can identify some stuff but does badly when there are disjoint regions where Z is true.
4. Ensemble methods such as Random Forests and Gradient Boosting do rather well on all the given inputs. They do well for both polygons and curves. Elsewhere, Ada Boost mostly does well but trips up on the hyperbola.
5. For some reason, Lasso fails to give an output (in the true spirit of pile-stirring, I didn’t explore why). Ridge is again a regression method and so does badly on this non-linear dataset
6. Neural Networks (Multi Layer Perceptron to be precise) does reasonably well, but can’t figure out the sharp edges of the polygons.
7. Decision trees again do rather well. I’m pleasantly surprised that they pick up and classify the disjoint sets (multi-circle and hyperbola) correctly. Maybe it’s the way scikit learn implements them?

Of course, the datasets that one comes across in real life are never such simple geometrical figures, but I hope that this set can give you some idea on what techniques to use where.

At least I hope that this makes you think about the suitability of different techniques for the data rather than simply applying all the techniques you know and then picking the one that performs best on your given training and test data.

That would count as nothing different from p-hacking, and there’s an XKCD for that as well!

## Machine learning and degrees of freedom

For starters, machine learning is not magic. It might appear like magic when you see Google Photos automatically tagging all your family members correctly, down to the day of their birth. It might appear so when Siri or Alexa give a perfect response to your request. And the way AlphaZero plays chess is almost human!

But no, machine learning is not magic. I’d made a detailed argument about that in the second edition of my newsletter (subscribe if you haven’t already!).

One way to think of it is that the output of a machine learning model (which could be anything from “does this picture contain a cat?” to “is the speaker speaking in English?”) is the result of a mathematical formula, whose parameters are unknown at the beginning of the exercise.

As the system gets “trained” (of late I’ve avoided using the word “training” in the context of machine learning, preferring to use “calibration” instead. But anyway…), the hitherto unknown parameters of the formula get adjusted in a manner that the formula output matches the given data. Once the system has “seen” enough data, we have a model, which can then be applied on unknown data (I’m completely simplifying it here).

The genius in machine learning comes in setting up mathematical formulae in a way that given input-output pairs of data can be used to adjust the parameters of the formulae. The genius in deep learning, which has been the rage this decade, for example, comes from a 30-year old mathematical breakthrough called “back propagation”. The reason it took until a few years back for it to become a “thing” has to do with data availability, and compute power (check this terrific piece in the MIT Tech Review about deep learning).

Within machine learning, the degree of complexity of a model can vary significantly. In an ordinary univariate least squares regression, for example, there are only two parameters the system can play with (slope and intercept of the regression line). Even a simple “shallow” neural network, on the other hand, has thousands of parameters.

Because a regression has so few parameters, the kind of patterns that the system can detect is rather limited (whatever you do, the system can only draw a line. Nothing more!). Thus, regression is applied only when you know that the relationship that exists is simple (and linear), or when you are trying to force-fit a linear model.

The upside of simple models such as regression is that because there are so few parameters to be adjusted, you need relatively few data points in order to adjust them to the required degree of accuracy.

As models get more and more complicated, the number of parameters increases, thus increasing the complexity of patterns that can be detected by the system. Close to one extreme, you have systems that see lots of current pictures of you and then identify you in your baby pictures.

Such complicated patterns can be identified because the system parameters have lots of degrees of freedom. The downside, of course, is that because the parameters start off having so much freedom, it takes that much more data to “tie them down”. The reason Google Photos can tag you in your baby pictures is partly down to the quantum of image data that Google has, which does an effective job of tying down the parameters. Google Translate similarly uses large repositories of multi-lingual text in order to “learn languages”.

Like most other things in life, machine learning also involves a tradeoff. It is possible for systems to identify complex patterns, but for that you need to start off with lots of “degrees of freedom”, and then use lots of data to tie down the variables. If your data is small, then you can only afford a small number of parameters, and that limits the complexity of patterns that can be detected.

One way around this, of course, is to use your own human intelligence as a pre-processing step in order to set up parameters in a way that they can be effectively tuned by data. Gopi had a nice post recently on “neat learning versus deep learning“, which is relevant in this context.

Finally, there is the issue of spurious correlations. Because machine learning systems are basically mathematical formulae designed to learn patterns from data, spurious correlations in the input dataset can lead to the system learning random things, which can hamper its predictive power.

Data sets, especially ones that have lots of dimensions, can display correlations that appear at random, but if the input dataset shows enough of these correlations, the system will “learn” them as a pattern, and try to use them in predictions. And the more complicated your model gets, the harder it is to know what it is doing, and thus the harder it is to identify these spurious correlations!

And the thing with having too many “free parameters” (lots of degrees of freedom but without enough data to tie down the parameters) is that these free parameters are especially susceptible to learning the spurious correlations – for they have no other job.

Thinking about it, after all, machine learning systems are not human!

## Nested Ternary Operators

It’s nearly twenty years since I first learnt to code, and the choice of language (imposed by school, among others) then was C. One of the most fascinating things about C was what was simply called the “ternary operator”, which is kinda similar to the IF statement in Excel, ifelse statement in R and np.where statement in Python.

Basically the ternary operator consisted of a ‘?’ and a ‘:’. It was a statement that took the form of “if this then that else something else”. So, for example, if you had two variables $a$ and $b$, and had to return the maximum of them, you could use the ternary operator to say $a>b?a:b$.

Soon I was attending programming contests, where there would be questions on debugging programs. These would inevitably contain one question on ternary operators. A few years later I started attending job interviews for software engineering positions. The ternary operator questions were still around, except that now it would be common to “nest” ternary operators (include one inside the other). It became a running joke that the only place you’d see nested ternary operators was in software engineering interviews.

The thing with the ternary operator is that while it allows you to write your program in fewer lines of code and make it seem more concise, it makes the code a lot less readable. This in turn makes it hard for people to understand your code, and thus makes it hard to debug. In that sense, using the operator while coding in C is not considered particularly good practice.

It’s nearly 2018 now, and C is not used that much nowadays, so the ternary operator, and the nested ternary operator, have made their exit – even from programming interviews if I’m not wrong. However, people still continue to maintain this practice of writing highly optimised code.

Now, every programmer who thinks he’s a good programmer likes to write efficient code. There’s this sense of elegance about code written in a rather elegant manner, using only a few lines. Sometimes such elegant code is also more efficient, speeding up computation and consuming less memory (think, for example, vectorised operations in R).

The problem, however, is that such elegance comes with a tradeoff with readability. The more optimised a piece of code is, the harder it is for someone else to understand it, and thus the harder it is to debug. And the more complicated the algorithm being coded, the worse it gets.

It makes me think that the reason all those ternary operators used to appear in those software engineering interviews (FYI I’ve never done a software engineering job) is to check if you’re able to read complicated code that others write!

## AlphaZero defeats Stockfish: Quick thoughts

The big news of the day, as far as I’m concerned, is the victory of Google Deepmind’s AlphaZero over Stockfish, currently the highest rated chess engine. This comes barely months after Deepmind’s AlphaGo Zero had bested the earlier avatar of AlphaGo in the game of Go.

Like its Go version, the AlphaZero chess playing machine learnt using reinforcement learning (I remember doing a term paper on the concept back in 2003 but have mostly forgotten). Basically it wasn’t given any “training data”, but the machine trained itself on continuously playing with itself, with feedback given in each stage of learning helping it learn better.

After only about four hours of “training” (basically playing against itself and discovering moves), AlphaZero managed to record this victory in a 100-game match, winning 28 and losing none (the rest of the games were drawn).

There’s a sample game here on the Chess.com website and while this might be a biased sample (it’s likely that the AlphaZero engineers included the most spectacular games in their paper, from which this is taken), the way AlphaZero plays is vastly different from the way engines such as Stockfish have been playing.

I’m not that much of a chess expert (I “retired” from my playing career back in 1994), but the striking things for me from this game were

• the move 7. d5 against the Queen’s Indian
• The piece sacrifice a few moves later that was hard to see
• AlphaZero’s consistent attempts until late in the game to avoid trading queens
• The move Qh1 somewhere in the middle of the game

In a way (and being consistent with some of the themes of this blog), AlphaZero can be described as a “stud” chess machine, having taught itself to play based on feedback from games it’s already played (the way reinforcement learning broadly works is that actions that led to “good rewards” are incentivised in the next iteration, while those that led to “poor rewards” are penalised. The challenge in this case is to set up chess in a way that is conducive for a reinforcement learning system).

Engines such as StockFish, on the other hand, are absolute “fighters”. They get their “power” by brute force, by going down nearly all possible paths in the game several moves down. This is supplemented by analysis of millions of existing games of various levels which the engine “learns” from – among other things, it learns how to prune and prioritise the paths it searches on. StockFish is also fed a database of chess openings which it remembers and tries to play.

What is interesting is that AlphaZero has “discovered” some popular chess openings through the course of is self-learning. It is interesting to note that some popular openings such as the King’s Indian or French find little favour with this engine, while others such as the Queen’s Gambit or the Queen’s Indian find favour. This is a very interesting development in terms of opening theory itself.

In any case, my immediate concern from this development is how it will affect human chess. Over the last decade or two, engines such as stockfish have played a profound role in the development of chess, with current top players such as Magnus Carlsen or Sergey Karjakin having trained extensively with these engines.

The way top grandmasters play has seen a steady change in these years as they have ingested the ideas from engines such as StockFish. The game has become far more quiet and positional, as players seek to gain small advantages which steadily improves over the course of (long) games. This is consistent with the way the engines that players learn from play.

Based on the evidence of the one game I’ve seen of AlphaZero, it plays very differently from the existing engines. Based on this, it will be interesting to see how human players who train with AlphaZero based engines (or their clones) will change their game.

Maybe chess will turn back to being a bit more tactical than it’s been in the last decade? It’s hard to say right now!

## Algorithms and the Turing Test

One massive concern about the rise of artificial intelligence and machine learning is the perpetuation of human biases. This could be racism (the story, possibly apocryphal, of a black person being tagged as a gorilla) or sexism (see tweet below) or any other forms of discrimination (objective looking data that actually represents certain divisions).

In other words, mainstream concern about artificial intelligence is that it is too human, and such systems should somehow be “cured” of their human biases in order to be fair.

My concern, though, is the opposite. That many of the artificial intelligence and machine learning systems are not “human enough”. In other words, that most present day artificial intelligence and machine learning systems would not pass the Turing Test.

To remind you of the test, here is an extract from Wikipedia:

The Turing test, developed by Alan Turing in 1950, is a test of a machine’s ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversationsbetween a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation is a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel such as a computer keyboard and screen so the result would not depend on the machine’s ability to render words as speech.[2] If the evaluator cannot reliably tell the machine from the human, the machine is said to have passed the test. The test does not check the ability to give correct answers to questions, only how closely answers resemble those a human would give.

The test was introduced by Turing in his paper, “Computing Machinery and Intelligence“, while working at the University of Manchester (Turing, 1950; p. 460).

Think of any recommender system, for example. With some effort, it is easy for a reasonably intelligent human to realise that the recommendations are being made by a machine. Even the most carefully designed recommender systems give away the fact that their intelligence is artificial once in a while.

To take a familiar example, people talk about the joy of discovering books in bookshops, and about the quality of recommendations given by an expert bookseller who gets his customers. Now, Amazon perhaps collects more data about its customers than any such bookseller, and uses them to recommend books. However, even a little scrolling reveals that the recommendations are rather mechanical and predictable.

It’s similar with my recommendations on Netflix – after a point you know the mechanics behind them.

In some sense this predictability is because the designers possibly think it’s a good thing – Netflix, for example, tells you why it has recommended a particular video. The designers of these algorithms possibly think that explaining their decisions might given their human customers more reason to trust them.

(As an aside, it is common for people to rant against the “opaque” algorithms that drive systems as diverse as Facebook’s News Feed and Uber’s Surge Pricing. So perhaps some algorithm designers do see reason in wanting to explain themselves).

The way I see it, though, by attempting to explain themselves these algorithms are giving themselves away, and willingly failing the Turing test. Whenever recommendations sound purely mechanical, there is reason for people to start trusting them less. And when equally mechanical reasons are given for these mechanical recommendations, the desire to trust the recommendations falls further.

If I were to design a recommendation system, I’d introduce some irrationality, some hard-to-determine randomness to try make the customer believe that there is actually a person behind them. I believe it is a necessary condition for recommendations to become truly personalised!