“Amit” is a word that is commonly used, often pejoratively, to refer to men from the North of India. The reason for the usage of “Amit” in this context is that while it is an extremely common name for men from North India, it is not as common in other parts of India, and thus it characterises men from North India.
A question that has been floating around in social media circles for a long time in this connection is what the feminine form of “Amit” is. If Amit characterises the median North Indian male, what name characterises the median North Indian female? Popular candidates for this are Neha, Isha and Pooja. Pooja suffers from the fact that is is also a fairly common name in other parts of India. Isha, while it might be strongly North Indian, is too obscure. And for some reason, people are loathe to accept Neha as the feminine Amit. So how do we resolve this?
I, being a stud, am a big follower of the Hanuman principle. If you have to solve a problem, and it takes no more effort to solve a generic problem, then solve the generic problem and apply it to this problem as a special instance rather than spending time to solve each instance. Hence, we will rephrase this problem as “What first name uniquely identifies a particular ethnicity?”. I, being a quant, am going to use the quantitative hammer to hammer down this nail. So we can rephrase as “how can we quantitatively characterise ethnicities by first names?”
The first thing to notice is that we need a frame of reference. Amit is a good name to characterise a North Indian man among the universe of Indian men. However, if we define the universe differently, as “Asian” for example, or “men living in Delhi”, Amit may not be as characteristic at all. Hence, any formula that we develop needs to take into account the frame of reference.
Secondly, what makes a name ethnically characteristic? I argue that there are two factors, and these two will be used in deriving the final formula. Firstly, the name should be common among the particular ethnicity – for example, Murugaselvan is extremely characteristic of Tamil men, but its occurrence is so low that using Murugaselvan as the median Tamil man among all Indian men is futile. Secondly, the name should be distinctive for that particular community. For example, a possible competitor to Amit is Rahul, a name that is possibly as common among North Indians as Amit is (I haven’t seen the statistics). The problem with Rahul, however, is that it is a fairly common name in South India also! So it does a bad job in terms of discrimination. So basically what we are looking for is a name that is both popular in the ethnicity we want to characterise, and also characteristic to that particular ethnicity in comparison to the universe.
These two requirements lead to the following rather simple formula (I’m not claiming that this is the best formula – if there is a way to objectively evaluate such formulas, that is – but it is sufficiently good and simple to understand and evaluate). Let our universe by U and the community we are trying to characterise by C. C’ is {U – C} (I’m assuming all of you know set theoretic notation). The first name N that characterises the community C is the one that maximises P(N|C) – P(N|C’). That’s it. Simple.
To explain in English, for each first name, we calculate the incidence of that particular name in the community C. That is, for example, what proportion of North Indian girls are named Neha, Pooja, Isha, Nidhi, etc. Next, we calculate the incidence of the name in the “complement of C”, that is how likely is it that someone in the rest of the “universe” we have defined has the same name. In our above example, we calculate what proportion of Indian but NOT North Indian girls (taking Indian women as the universe) are named Neha, Pooja, Isha, Nidhi, etc. Then, for each name, we subtract the latter quantity from the former quantity and then select the name for which this difference is maximum! Rather simple, I would think!
Now, we need data. Unfortunately I can’t seem to find any publicly available data sets that contain long lists of names along with markers of ethnicity (address or city or state or language preference or some such). If you can help me with some data sets, we can actually run the above formula for different ethnicities and characterise them. It is going to be a fun exercise, I promise! So pour in the data. And I request you to share publicly available data and not proprietary data.
And then we can for once and for all finish this debate of what the feminine form of Amit is, along with many other fun ethnic classifications.