Algorithms and the Turing Test

One massive concern about the rise of artificial intelligence and machine learning is the perpetuation of human biases. This could be racism (the story, possibly apocryphal, of a black person being tagged as a gorilla) or sexism (see tweet below) or any other forms of discrimination (objective looking data that actually represents certain divisions).

In other words, mainstream concern about artificial intelligence is that it is too human, and such systems should somehow be “cured” of their human biases in order to be fair.

My concern, though, is the opposite. That many of the artificial intelligence and machine learning systems are not “human enough”. In other words, that most present day artificial intelligence and machine learning systems would not pass the Turing Test.

To remind you of the test, here is an extract from Wikipedia:

The Turing test, developed by Alan Turing in 1950, is a test of a machine’s ability to exhibit intelligent behavior equivalent to, or indistinguishable from, that of a human. Turing proposed that a human evaluator would judge natural language conversationsbetween a human and a machine designed to generate human-like responses. The evaluator would be aware that one of the two partners in conversation is a machine, and all participants would be separated from one another. The conversation would be limited to a text-only channel such as a computer keyboard and screen so the result would not depend on the machine’s ability to render words as speech.[2] If the evaluator cannot reliably tell the machine from the human, the machine is said to have passed the test. The test does not check the ability to give correct answers to questions, only how closely answers resemble those a human would give.

The test was introduced by Turing in his paper, “Computing Machinery and Intelligence“, while working at the University of Manchester (Turing, 1950; p. 460).

Think of any recommender system, for example. With some effort, it is easy for a reasonably intelligent human to realise that the recommendations are being made by a machine. Even the most carefully designed recommender systems give away the fact that their intelligence is artificial once in a while.

To take a familiar example, people talk about the joy of discovering books in bookshops, and about the quality of recommendations given by an expert bookseller who gets his customers. Now, Amazon perhaps collects more data about its customers than any such bookseller, and uses them to recommend books. However, even a little scrolling reveals that the recommendations are rather mechanical and predictable.

It’s similar with my recommendations on Netflix – after a point you know the mechanics behind them.

In some sense this predictability is because the designers possibly think it’s a good thing – Netflix, for example, tells you why it has recommended a particular video. The designers of these algorithms possibly think that explaining their decisions might given their human customers more reason to trust them.

(As an aside, it is common for people to rant against the “opaque” algorithms that drive systems as diverse as Facebook’s News Feed and Uber’s Surge Pricing. So perhaps some algorithm designers do see reason in wanting to explain themselves).

The way I see it, though, by attempting to explain themselves these algorithms are giving themselves away, and willingly failing the Turing test. Whenever recommendations sound purely mechanical, there is reason for people to start trusting them less. And when equally mechanical reasons are given for these mechanical recommendations, the desire to trust the recommendations falls further.

If I were to design a recommendation system, I’d introduce some irrationality, some hard-to-determine randomness to try make the customer believe that there is actually a person behind them. I believe it is a necessary condition for recommendations to become truly personalised!

Recommendations and rating systems

This is something that came out of my IIMB class this morning. We were discussing building recommendation systems, using the whisky database (check related blog posts here and here). One of the techniques of recommendation we were discussing was the “market basket analysis“, where you recommend products to people based on combinations of products that other people have been buying.

This is when one of the students popped up with the observation that market basket analysis done without “ratings” can be self-fulfilling! It was an extremely profound observation, so I made a mental note to blog about this. And I’ve told you earlier that this IIMB class that I’m teaching is good!

So the concept is that if a lot of people have been buying A and B together, then you start recommending B to buyers of A. Let us say that there are a number of people who are buying A and C, but not B, but based on our analysis that people buy A and B together, we recommend B to them. Let’s assume that they’ve taken our recommendation and bought B, which means that these people are now seen to have bought both B and C together.

Now, in case we don’t collect their feedback on B, we have no clue that they didn’t like B (let’s assume that for whatever reason buyers of C don’t like B), but in the next iteration, we see that buyers of C have been buying B, and so we start recommending B to other C buyers. And so a bad idea (recommending B to buyers of C, thanks to A) can spiral and put the confidence of our recommendation system in tatters.

Hence, it is useful to collect feedback (in the form of ratings) to items that we recommend to customers, so that these “recommended purchases” don’t end up distorting our larger data set!

Of course what I’m saying here is not definitive, and needs more work, but it is an interesting idea nevertheless and worth being communicated. There can be some easy workarounds – like not taking into account recommended products while doing the market basket analysis, or trying to find negative lists and so on.

Nevertheless, I thought this is an interesting concept and hence worth sharing.

Practo and rating systems

The lack of a rating system means Practo is unlikely to take off like other similar platforms

So yesterday I found a dermatologist via Practo, a website that provides listing services for doctors in India. I visited him today and have been thoroughly disappointed with the quality of service (he subjected me to a random battery of blood tests – to be done in his own lab; and seemed more intent on cross-selling moisturising liquid soap rather than looking at the rash on my hand). Hoping to leave a bad review I went back to the Practo website but there seems to be no such mechanism.

This is not surprising since doctors won’t want bad reviews about them to be public information. In the medical profession, reputational risk is massive and if bad word gets around about you, your career is doomed. Thus even if Practo were to implement a rating system, any doctors who were to get bad ratings (even the best doctors have off-days and that can lead to nasty ratings) would want to delist from the service for such ratings would do them much harm. This would in turn affect Practo’s business (since the more the doctors listed the more the searches and appointments), so they don’t have a rating system.

The question is if the lack of a rating system is going to hinder Practo’s growth as a platform. One of the reasons I would go to a website like Practo is when I don’t know any reliable doctors of the specialisation that I’m looking for. Now, Practo puts out some “objective” statistics about every doctor on its website – like their qualifications, number of years of experience and for some, the number of people who clicked through (like the doctor I went to today was a “most clicked” doctor, whatever that means), but none of them are really correlated with quality.

And healthcare is a sector where as Sangeet Paul Chaudary of Platform Thinking puts it, “sampling costs are high”. To quote him:

There are scenarios where sampling costs can be so high as to discourage sampling. Healthcare, for example, has extremely high sampling costs. Going to the wrong doctor could cost you your life. In such cases, some form of expert or editorial discretion needs to add the first layer of input to a curation system.

So the lack of a rating system means that Practo will end up at best as a directory listing service rather than as a recommendation service. Every time people find a “sub-optimal” doctor via Practo, their faith in the “platform” goes down and they become less likely to use the platform in the future for recommendation and curation. I expect Practo to reach the asymptotic state as a software platform for doctors to manage their appointments, where you can go to request an appointment after you’ve decided which doctor you want to visit!

Potential investors would do well to keep this in mind.

Update

Today I got an SMS from Practo asking me if I was happy with my experience. I voted by giving a missed call to one of the two given numbers. I don’t know how they’ll use it, though. The page only says how many upvotes each doctor got (for my search it was all in the low single digits), so is again of little use to the user.