Different people use the same rating scale in different ways. Hence, nuance is required while aggregating ratings taking decisions based on them
During the recent Times Lit Fest in Bangalore, I was talking to some acquaintances regarding the recent Uber rape case (where a car driver hired though the Uber app in Delhi allegedly raped a woman). We were talking about what Uber can potentially do to prevent bad behaviour from drivers (which results in loss of reputation, and consequently business, for Uber), when one of them mentioned that the driver accused of rape had an earlier complaint against him within the Uber system, but because the complainant in that case had given him “three stars”, Uber had not pulled him up.
Now, Uber has a system of rating both drivers and passengers after each ride – you are prompted to give the rating as soon as the ride is done, and you are unable to proceed to your next booking unless you’ve rated the previous ride. What this ensures is that there is no selection bias in rating – typically you leave a rating only when the product/service has been exceptionally good or bad, leading to skewed ratings. Uber’s prompts imply that there is no opportunity for such bias and ratings are usually fair.
Except for one problem – different people have different norms for rating. For example, i believe that there is nothing “exceptional” that an Uber driver can do for me, and hence my default rating for all “satisfactory” rides is a 5, with lower scores being used progressively for different levels of infractions. For another user, for example, the default might be 1, with 2 to 5 being used for various levels of good service. Yet another user might use only half the provided scale, with 3 being “pathetic”, for example. I once worked for a firm where annual employee ratings came out on a similar five-point scale. Over the years so much “rating inflation” had happened that back when I worked there anything marginally lower than 4 on 5 was enough to get you sacked.
What this means is that arithmetically averaging ratings across raters, and devising policies based on particular levels of ratings is clearly wrong. For example, when in the earlier case (as mentioned by my acquaintance) a user rated the offending driver a 3, Uber should not have looked at the rating in isolation, but in relation to other ratings given by that particular user (assuming she had used the service before).
It is a similar case with any other rating system – a rating looked at in isolation tells you nothing. What you need to do is to look at it in relation to other ratings by the user. It is also not enough to look at a rating in relation to just the “average” rating given by a user – variance also matters. Consider, for example, two users. Ramu uses 3 for average service, 4 for exceptional and 2 for pathetic. Shamu also uses 3 for average, but he instead uses the “full scale”, using 5 for exceptional service and 1 for pathetic. Now, if a particular product/service is rated 1 by both Ramu and Shamu, it means different things – in Shamu’s case it is “simply pathetic”, for that is both the lowest score he has given in the past and the lowest he can give. In Ramu’s case, on the other hand, a rating of 1 can only be described as “exceptionally pathetic”, for his variance is low and hence he almost never rates someone below 2!
Thus, while a rating system is a necessity in ensuring good service in a two-sided market, it needs to be designed and implemented in a careful manner. Lack of nuance in designing a rating system can result in undermining the system and rendering it effectively useless!
One thought on “Rating systems need to be designed carefully”