Bangalore names are getting shorter

The Bangalore Names Dataset, derived from the Bangalore Voter Rolls (cleaned version here), validates a hypothesis that a lot of people had – that given names in Bangalore are becoming shorter. From an average of 9 letters in the name for a male aged around 80, the length of the name comes down to 6.5 letters for a 20 year old male. 

What is interesting from the graph (click through for a larger version) is the difference in lengths of male and female names – notice the crossover around the age 25 or so. At some point in time, men’s names continue to become shorter while women’s names’ lengths stagnate.

So how are names becoming shorter? For one, honorific endings such as -appa, -amma, -anna, -aiah and -akka are becoming increasingly less common. Someone named “Krishnappa” (the most common name with the ‘appa’ suffix) in Bangalore is on average 56 years old, while someone named Krishna (the same name without the suffix) is on average only 44 years old. Similarly, the average age of people named Lakshmamma is 55, while that of everyone named Lakshmi is just 40.  while the average Lakshmi (same name no suffix) is just 40.

In fact, if we look at the top 12 male and female names with a honorific ending, the average age of the version without the ending is lower than that of the version with the ending. I’ve even graphed some of the distributions to illustrate this.

  In each case, the red line shows the distribution of the longer version of the name, and the blue line the distribution of the shorter version

In one of the posts yesterday, we looked at the most typical names by age in Bangalore. What happens when we flip the question? Can we define what are the “oldest” and “youngest” names? Can we define these based on the average age of people who hold that name? In order to rule out fads, let’s stick to names that are held by at least 10000 people each.

These graphs are candidates for my own Bad Visualisations Tumblr, but I couldn’t think of a better way to represent the data. These graphs show the most popular male and female names, with the average age of a voter with that given name on the X axis, and the number of voters with that name on the Y axis. The information is all in the X axis – the Y axis is there just so that names don’t overlap.

So Karthik is among the youngest names among men, with an average age among voters being about 28 (remember this is not the average age of all Karthiks in Bangalore – those aged below 18 or otherwise not eligible to vote have been excluded). On the women’s side, Divya, Pavithra and Ramya are among the “youngest names”.

At the other end, you can see all the -appas and -ammas. The “oldest male name” is Krishnappa, with an average age 56. And then you have Krishnamurthy and Narayana, which don’t have the -appa suffix but represent an old population anyway (the other -appa names just don’t clear the 10000 people cutoff).

More women’s names with the -amma suffix clear the 10000 names cutoff, and we can see that pretty much all women’s names with an average age of 50 and above have that suffix. And the “oldest female name”, subject to 10000 people having that name, is Muniyamma. And then you have Sarojamma and Jayamma and Lakshmamma. And a lot of other ammas.

What will be the oldest and youngest names we relax the popularity cutoff, and instead look at names with at least 1000 people? The five youngest names are Dhanush, Prajwal, Harshitha, Tejas and Rakshitha, all with an average age (among voters) less than 24. The five oldest names are Papamma, Kannamma, Munivenkatappa, Seethamma and Ramaiah.

This should give another indication of where names are headed in Bangalore!

Smashing the Law of Conservation of H

A decade and half ago, Ravikiran Rao came up with what he called the “law of conservation of H“. The concept has to do with the South Indian practice of adding a “H” to denote a soft consonant, a practice not shared by North Indians (Karthik instead of Kartik for example). This practice, Ravikiran claims, is balanced by the “South Indian” practice of using “S” instead of “Sh”, because of which the number of Hs in a name is conserved.

Ravikiran writes:

The Law of conservation of H states that the total number of H’s in the universe will be conserved. So the extra H’s that are added when Southies have to write names like Sunitha and Savitha are taken from the words Sasi and Sri Sri Ravisankar, thus maintaining a balance in the language.

Using data from the Bangalore first names data set (warning: very large file), it is clear that this theory doesn’t hold water, in Bangalore at least. For what the data shows is that not only do Bangaloreans love the “th” and “dh” for the soft T and D, they also use “sh” to mean “sh” rather than use “s” instead.

The most commonly cited examples of LoCoH are Swetha/Shweta and Sruthi/Shruti. In both cases, the former is the supposed “South Indian” spelling (with th for the soft T, and S instead of sh), while the latter is the “North Indian” spelling. As it turns out, in Bangalore, both these combinations are rather unpopular. Instead, it seems like if Bangaloreans can add a H to their name, they do. This table shows the number of people in Bangalore with different spellings for Shwetha and Shruthi (now I’m using the dominant Bangalorean spellings).

As you can see, Shwetha and Shruthi are miles ahead of any of the alternate ways in which the names can be spelt. And this heavy usage of H can be attributed to the way Kannada incorporates both Sanskrit and Dravidian history.

Kannada has a pretty large vocabulary of consonants. Every consonant has both the aspirated and unaspirated version, and voiced and unvoiced. There are three different S sounds (compared to Tamil which has none) and two Ls. And we need a way to transliterate each of them when writing in English. And while capitalising letters in the middle of a word (as per Harvard Kyoto convention) is not common practice, standard transliteration tries to differentiate as much as possible.

And so, since aspirated Tha and Dha aren’t that common in Kannada (except in the “Tha-Tha” symbols used by non-Kannadigas to show raised eyes), th and dh are used for the dental letters. And since Sh exists (and in two forms), there is no reason to substitute it with S (unlike Tamil). And so we have H everywhere.

Now, lest you were to think that I’m using just two names (Shwetha and Shruthi) to make my point, I dug through the names dataset to see how often names with interchangeable T and Th, and names with interchangeable S and Sh, appear in the Bangalore dataset. Here is a sample of both:

There are 13002 Karthiks registered to vote in Bangalore, but only 213 Kartiks. There are a hundred times as many Lathas as Latas. Shobha is far more common than Sobha, and Chandrashekhar much more common than Chandrasekhar.

 

So while other South Indians might conserve H, by not using them with S to compensate for using it with T and D, it doesn’t apply to Bangalore. Thinking about it, I wonder how a Kannadiga (Ravikiran) came up with this theory. Perhaps the fact that he has never lived in Karnataka explains it.

The popularity of nicknames and political correctness

It is a rite of passage in an institution such as IIT (Indian Institute of Technology) that a first year student be given a potentially embarrassing nickname following “interaction” with senior students. The profundity of these nicknames varies significantly, with some people simply being given names that correspond to body parts in different languages, which others have more involved names.

Based on a conversation yesterday, the hypothesis is that more profound nicknames which are embarrassing only in a particular context are more likely to propagate, and thus stick, while the more crass names are likely to die out more easily.

The logic is simple – the crass names (a few examples being “lund”, “condom” and “dildo” – there is at least one person with each of these names in every hostel of every batch at IIT Madras) are potentially embarrassing for an “outsider” to use, and to be used in public. So when the bearer of such a name graduates and moves on to a new setting, the new people he encounters make a prudent choice to not use the embarrassing word, and the nickname dies a quick death.

When the nickname is embarrassing or derogatory for more contextual reasons, though, the name quickly loses its context and becomes incredibly simple for people to use. Take my own name “Wimpy”, for example – not too many people know it has an embarrassing origin, and it is a perfectly respectable word to shout out in public, or even in an office setting. And so it has propagated – in at least two offices I worked in, everyone called me “Wimpy”.

It is similar for lots of other “benign” names. But it is unlikely that a name like “condom” or “dildo” will propagate, and it is in fact more likely that even the people who bestowed such names upon the unsuspecting will stop using them once everyone graduates and moves on to a more formal environment.

There are exceptions, of course, a notable one being “Baada“. It is a cuss-word representing a body part, except that it is in a non-standard (though not small by any means) language, but everyone I know calls Baada Baada. He used to be my colleague, and people at work also called him Baada. It is unlikely that his nickname would’ve propagated, though, had it been the synonym in English or Hindi.

Thanks to Katpadi Katsa for discussions leading up to this post. In a future post, I’ll talk about models for propagation of nicknames across institutions.

 

 

Diversity and sorting by last name

So the wife graduated today. The graduation ceremony was in threes – three graduates were called at a time and presented their degrees (the wife now claims that she has one more degree than me, since my B-school gave me a Post Graduate Diploma and not a Masters).

It was reminiscent of swearing in of Ministers of State in India, who take oath four at a time. My graduation ceremonies, where we collected our degrees one at a time, was more like the swearing in of Cabinet Ministers. This simultaneous award of degrees worked well in finishing the ceremony in good time, though.

As is usual in such ceremonies, the graduates had been sorted by name. Except that since this is a global business school, the sorting was done by <Last Name> followed by <First Name> (at all my schools, sorting has been in the opposite order).

This related to fairly hilarious bunching of graduates from different countries at the same point in time. One batch of three was a set of three Lee’s, for example (rather amazingly, there was not a single Wang in the graduating class). They were followed by two more Lee’s/Li’s. Another set of three were three Japanese who had the same prefix to the last name.

And the wife was one of three Indians in the batch whose last name started with “Bha-“. It’s a rather unique Indian construct, and the three were listed consecutively for graduation. It was only because of a “cut” that occurred in the middle that the three didn’t go simultaneously to receive their degrees.

Different countries have different name forms and the same words might occur as a prefix of a large number of last names from the country. Such prefixes might also be unique to certain countries, thanks to which sorting by last name results in the occurrence of several “country clusters” through the course of the list.

It got me wondering if the diversity of the batch (more than 50 countries were represented in the graduating class of ~300) mgiht have been exhibited better, and people of the same nationality been spread apart more widely through the list had they done (what is to us Indians) the conventional thing and sorted by first name instead!

What is the feminine of Amit?

“Amit” is a word that is commonly used, often pejoratively, to refer to men from the North of India. The reason for the usage of “Amit” in this context is that while it is an extremely common name for men from North India, it is not as common in other parts of India, and thus it characterises men from North India.

A question that has been floating around in social media circles for a long time in this connection is what the feminine form of “Amit” is. If Amit characterises the median North Indian male, what name characterises the median North Indian female? Popular candidates for this are Neha, Isha and Pooja. Pooja suffers from the fact that is is also a fairly common name in other parts of India. Isha, while it might be strongly North Indian, is too obscure. And for some reason, people are loathe to accept Neha as the feminine Amit. So how do we resolve this?

I, being a stud, am a big follower of the Hanuman principle. If you have to solve a problem, and it takes no more effort to solve a generic problem, then solve the generic problem and apply it to this problem as a special instance rather than spending time to solve each instance. Hence, we will rephrase this problem as “What first name uniquely identifies a particular ethnicity?”. I, being a quant, am going to use the quantitative hammer to hammer down this nail. So we can rephrase as “how can we quantitatively characterise ethnicities by first names?”

The first thing to notice is that we need a frame of reference. Amit is a good name to characterise a North Indian man among the universe of Indian men. However, if we define the universe differently, as “Asian” for example, or “men living in Delhi”, Amit may not be as characteristic at all. Hence, any formula that we develop needs to take into account the frame of reference.

Secondly, what makes a name ethnically characteristic? I argue that there are two factors, and these two will be used in deriving the final formula. Firstly, the name should be common among the particular ethnicity – for example, Murugaselvan is extremely characteristic of Tamil men, but its occurrence is so low that using Murugaselvan as the median Tamil man among all Indian men is futile. Secondly, the name should be distinctive for that particular community. For example, a possible competitor to Amit is Rahul, a name that is possibly as common among North Indians as Amit is (I haven’t seen the statistics). The problem with Rahul, however, is that it is a fairly common name in South India also! So it does a bad job in terms of discrimination. So basically what we are looking for is a name that is both popular in the ethnicity we want to characterise, and also characteristic to that particular ethnicity in comparison to the universe.

These two requirements lead to the following rather simple formula (I’m not claiming that this is the best formula – if there is a way to objectively evaluate such formulas, that is – but it is sufficiently good and simple to understand and evaluate). Let our universe by U and the community we are trying to characterise by C. C’ is {U – C} (I’m assuming all of you know set theoretic notation). The first name N that characterises the community C is the one that maximises P(N|C) – P(N|C’). That’s it. Simple.

To explain in English, for each first name, we calculate the incidence of that particular name in the community C. That is, for example, what proportion of North Indian girls are named Neha, Pooja, Isha, Nidhi, etc. Next, we calculate the incidence of the name in the “complement of C”, that is how likely is it that someone in the rest of the “universe” we have defined has the same name. In our above example, we calculate what proportion of Indian but NOT North Indian girls (taking Indian women as the universe) are named Neha, Pooja, Isha, Nidhi, etc. Then, for each name, we subtract the latter quantity from the former quantity and then select the name for which this difference is maximum! Rather simple, I would think!

Now, we need data. Unfortunately I can’t seem to find any publicly available data sets that contain long lists of names along with markers of ethnicity (address or city or state or language preference or some such). If you can help me with some data sets, we can actually run the above formula for different ethnicities and characterise them. It is going to be a fun exercise, I promise! So pour in the data. And I request you to share publicly available data and not proprietary data.

And then we can for once and for all finish this debate of what the feminine form of Amit is, along with many other fun ethnic classifications.