Breaking up sentences in the absence of punctuation

From time to time, a joke goes around that makes the value of punctuation clear. Check out this picture, for example.

Recently, I saw this on my twitter timeline (though here it’s an issue of spacing apart from punctuation).

Someone actually wrote an entire book about the value of punctuation.

In any case, I have a pretty bad track record in terms of reading sentences that don’t have punctuation. I can think of two examples right away.

Firstly, my school diary was filled with quotes from Sri Aurobindo and “The Mother“. By turns, we would have to say “thought for the day” in the school assembly. And we had a reliable way of finding such thoughts – just look in the diary and spout out the nuggets.

One of those went:

Always do what you know to be the best even if it the most difficult thing to do.

Yes, I remember that. Because I had spouted this not once but twice. Now, this is a long sentence without any punctuation. How would you read it?

For the longest time I read it like this.

Always do what you know, to be the best, even if it is the most difficult thing to do. 

So if there are many things that you can do, and you know one of the things, you do that thing even if it is harder than everything else that you might do (but don’t know how to do).

Clearly that doesn’t make that much sense. It was only when I was about to graduate that I figured that it was actually:

Always do what you know to be the best even if it is the most difficult thing to do. 

So there are many things you can do. You know one of them is the “best thing to do”. So even if it is the most difficult thing to do, you do it because you know it is “the best” (not because you “know it the best”).

Another example is from this store near my house. I don’t think the store existed for too long, but it had an interesting and quirky name (in Kannada).

ellAdEvarakrupe stores“, it said. For the longest time I read it as “ellAdEvara  krupe”, or “grace of all gods”. And I thought it was a fascinating name in terms of recognising all religions. And I’ve quoted it many times.

When I was quoting this on Twitter earlier today, I realised that I had got the name of the store all wrong. It’s “ellA dEvarakrupe”, meaning “everything is god’s grace”. It says nothing about which gods are included or excluded, or how many gods there are.

What are your favourite examples of sentences that you’ve misread thanks to the lack of punctuation or other visible sentence markers?


Paiyas and Kodakas

Growing up, I found that a lot of my non-Kannadiga friends took great pleasure in using the words “maga” and “magane” (both mean “son”). For a long time I didn’t understand what was so pleasurable in calling someone “son”. Wasn’t it normal in other languages as well (though Tamil prefers Macha (brother-in-law) ) ?

It took two incidents, separated by six years (and the latter of the two happened ten years ago), for me to understand this. It had to do with abuses.

I remember visiting a Tamilian friend at her home sometime in 2004. There were a few other friends there, and everyone who was there except me was Tamilian (and this is a 20 year-old problem – people randomly assume I’m Tamilian and speak to me in Tamil). So the host’s mother, in the course of the conversation, would break off into Tamil, and when the discussion was about some boys, would talk about “this paiyya” or “that paiyya”.

I remember trying to suppress a chuckle every time she said “paiyya” (I’ll come to the reason in a bit), but largely managed to keep a straight face through the conversation.

Six years later I was visiting my then-girlfriend, now-wife. Pinky’s mother is Gult (technically her father is also Gult, but his ancestors came to Karnataka so long ago that for all practical purposes they’re dig). On the day I visited, Pinky’s aunt was also visiting, and Pinky’s mother and aunt were talking (in Gult) about some boys. And they kept referring to these boys as “koDaku”.

Again I had to suppress chuckles, for the same reason I had suppressed chuckles when my friend’s mother kept saying “paiyya” six years before. And at the same time I understood why my non-Kannadiga friends took such pleasure in saying “magane”. It has to do with abuses.

When you learn a new language as a teenager, it is fairly standard to start off by first learning the swearwords in that language. For some strange reason, South Indians revel in abusing one another’s mothers. And so the popular abuses in all South Indian languages follow this template.

In Kannada, you have “bOLi magane” (son of a bitch) and “sULe magane” (son of a prostitute). Tamil has “thEvaDiya paiyya” (son of a prostitute again). Telugu has “lanja koDaka” (son of a prostitute, once again) and, rather fascinatingly for the amateur anthropologist, “donganA koDaka” (son of a thief).

And in Telugu and Tamil, the word for “boy” is also used interchangeably for “son”, and it’s the same word that appears in the above swear-phrases (Kannada is a little bit different – the word for “boy” is used for “son”, but the swearwords all have the word that is exclusively used for “son”).

Now you know where this is going.

In normal teenage or college conversation it’s not common to talk about people’s sons. So if you’re a Kannadiga who’s only learnt swearwords in Telugu or Tamil, you would have heard the words “koDaka” and “paiyya” in only that context. You would have never heard these words in isolation in normal conversation, separated from the prefixes that make them the swearing qualities.

So because “thevaDiya paiyya” is a swearphrase, I had assumed that both words in it are independently swearwords. And so I got shocked that my friend’s mother kept casually saying “paiyya” in the course of normal conversation, and my (extremely paavam/sadhu) friends didn’t flinch.

It is the same with “koDaka” – having appeared in TWO swearphrases I knew, I assumed it was a swearword, and was shocked to see my would-be mother-in-law use it in a casual conversation with her sister.

I imagine it is the same with “magane” – for non-Kannadigas for whom it’s just part of a swearphrase, it is effectively a swearword. And so, when they use the word, it’s as if they are swearing. And that explains their glee in uttering the word.

Kannada has another son-based swearword. “baDDi maga”, which translates to “son of interest” (as in the interest you pay on a loan). I’ve never understood the logic behind that one.

Default Acronym Expansions

Based on the kind of stuff we are interested in, each of us has our own “default expansions” for acronyms.

Now, there are only 26 letters in the English alphabet (and some are much more common than others), and a good acronym is 2-4 letters long, so there are so many acronyms going around. So it is inevitable that there is acronym overloading, with the same acronym meaning different things in different contexts.

In this context, whenever we see an acronym, we have a default expansion of it based on our interests and domains and exposures. And this can lead to some hilarious interpretations at times.

I read this newsletter called “Margins“. I don’t agree with everything they write, but they write about interesting stuff so I read them. Yesterday’s edition had this gem:

Clearly, the 2008 Financial Crisis and the blowup of CDOs and MBSs left a bad taste in people’s mouths over the chopping up and passing off of debt (note: I now get uncomfortable every time I write “MBS” and “chopped up” in a sentence).

This joke works only because of acronym overloading. MBS also refers to Saudi crown prince Mohammed Bin Salman, and he “allegedly” got dissident journalist Jamal Khashoggi, who worked for the Washington Post, literally chopped up (for those of you for whom Mohammed Bin Salman is the default MBS, it can also refer to Mortgage Backed Securities).

Long ago, I worked for a company that had launched a product acronymised as “LFM”. I could never understand what this product does because my “default expansion” of LFM is Left Arm Fast Medium.

Acronym confusion can also happen when you’re deeply familiar with one domain with its own set of jargons and acronyms, and then are suddenly exposed to another domain that has its own set of jargons and acronyms. It takes a long time to “unlearn” your old acronyms and then learn the new ones.

Then again, given the limited number of acronyms available, sooner or later we better learn to learn and unlearn new meanings of acronyms.

Maybe one day Kohlberg Kravis Roberts will buy Kolkata Knight Riders
I still don’t understand how the IPL allowed Delhi Capitals since there used to exist a team called Deccan Chargers in the same league
Does your All India Rank get announced on All India Radio?

Giant Squid is Good Stuff.


The difficulty of song translation

One of my wife’s favourite nursery rhymes is this song that is sung to the tune of “for he’s a jolly good fellow”, and about a bear going up a mountain.

For a long time I only knew of the Kannada version of this song (which is what the wife used to sing), but a year or two back, I found the “original” English version as well.

And that was a revelation, for the lyrics in the English version make a lot more sense. They go:

The bear went over the mountain;
The bear went over the mountain
The bear went over the mountain, to see what he could see.
And all that he could see, and all that he could see
Was the other side of the mountain,
The other side of the mountain
The other side of the mountain, was all that he could see.

Now, the Kannada version, sung to the same tune, obviously goes “???? ??????? ??????” (karaDi beTTakke hoithu). That part has been well translated. However, the entire stanza hasn’t been translated properly, because of which the song goes a bit meaningless.

The lyrics, when compared to the original English version, are rather tame. Since a large part of my readership don’t understand Kannada, here is my translation of the lyrics (btw, the lyrics used in these YouTube versions are different from the lyrics that my wife sings, but both are similar):

The bear went to the mountain.
The bear went to the mountain.
The bear went to the mountain.
To see the scenery

And what did it see?
What did it see?
The other side of the mountain.
The other side of the mountain.
It saw the scenery of the other side of the mountain.

Now, notice the important difference in the two versions, which massively changes the nature of the song. The Kannada version simply skips the “all that he could see” part, which I think is critical to the story.

The English version, in a way, makes fun of the bear, talking about how it went over the mountain thinking it’s a massive task, but “all that he could see” from there was merely the other side of the mountain. This particular element is missing in Kannada – there is nothing in the lyrics that suggests that the bear’s effort to climb the mountain was a bit of a damp squib.

And that,  I think, is due to the difficulty of translating songs. When you translate a song, you need to get the same letter and spirit of the lyrics, while making sure they can follow the already-set music as well (and even get the rhyming right). And unless highly skilled bilingual poets are involved, this kind of a translation is really difficult.

So you get half-baked translations, like the bear story, which possibly captures the content of the story but completely ignores its spirit.

After I had listened to the original English version, I’ve stopped listening to the Kannada version of the bear-mountain song. Except when the wife sings it, of course.


Gults and Grammar

Back in IIT, it was common to make fun of people from Andhra Pradesh for their poor command over the English language. It was a consequence of the fact that JEE coaching is far more institutionalised in that (undivided) state, because of which people come to IIT from less privileged backgrounds (on average) than their counterparts in Karnataka or Tamil Nadu or Maharashtra.

Now, in hindsight, making fun of people’s English doesn’t sound particularly nice, but sometimes stories come up that make it incredibly hard to resist.

This one is from Matt Levine’s newsletter. And it is about an insider trading ring. This is a quote that Levine has quoted in his newsletter (pay attention to the names):

According to the SEC’s complaint, Janardhan Nellore, a former IT administrator then at Palo Alto Networks Inc., was at the center of the trading ring, using his IT credentials and work contacts to obtain highly confidential information about his employer’s quarterly earnings and financial performance. As alleged in the complaint, until he was terminated earlier this year, Nellore traded Palo Alto Networks securities based on the confidential information or tipped his friends, Sivannarayana Barama, Ganapathi Kunadharaju, Saber Hussain, and Prasad Malempati, who also traded.

The SEC’s complaint alleges that the defendants sought to evade detection, with Nellore insisting that the ring use the code word “baby” in texts and emails to refer to his employer’s stock, and advising they “exit baby,” or “enter few baby.” The complaint also alleges that certain traders kicked back trading profits to Nellore in small cash transactions intended to avoid bank scrutiny and reporting requirements. After the FBI interviewed Nellore about the trading in May, he purchased one-way tickets to India for himself and his family and was arrested at the airport.

You can look at Levine’s newsletter to understand his take on the story (it’s towards the bottom), but what catches my eye is the grammar. I think it is all fine to refer to the insider-traded stock as a “baby”, but at least be grammatically correct about it!

“Enter few baby” is so obviously grammatically incorrect (it’s hard to even be a typo) that when intercepted by someone like the SEC, it would immediately send alarm bells ringing. Which is what I suppose possibly happened.

So my take on this case is – don’t insider  trade, but even if you do, be grammatical about your signals. If you’re so obviously grammatically wrong, it is easy for whoever intercepts your chats to know you’re up to something fishy.

But then if you’re gult..

YG Rao

We’re celebrating Ganesha Chaturthi by re-watching Ganeshana Maduve and Gowri Ganesha, two classic movies from the early 1990s starring Anant Nag and Vinaya Prasad.

Ganeshana Maduve is a shop-around-the-corner / you’ve-got-mail kind of story of real-life neighbours who hate each other who court each other through letters. Real-life Adilakshmi has adopted the name “Shruti” for her singing career, and she replies to her fan-mail under the same name.

It is her fan/neighbour’s name that had intrigued me thus far. He is the titular Ganesha, but saying that “Ganesha” sounds too old-fashioned, he writes his letters under the name “Y G Rao”, short for his full name which is “Y Ganesh Rao”.

Now, this would have been fine, except that later on in the movie his father’s name is shown to be Govinda. And under conventional Kannada naming conventions, this simply doesn’t make sense. Typically in most Kannada names, if you have only one “initial” that represents your father’s given name (for example, the S in my name stands for Shashidhar, which is my father’s given name).

Hence, under standard Kannada naming conventions, Govinda’s son has to be G Ganesh Rao. And in what is an overall excellent movie (it’s easily my most-watched movie of all time. Today was perhaps the 50th time I watched it), this naming convention was a bit intriguing.

The thing with Ganeshana Maduve is that each time you watch it, you discover a layer that you hadn’t discovered  (or missed) earlier. And one detail I found today that I’d missed earlier, is that the movie is based on a Telugu novel. And then it all started making sense.

It is perfectly okay under Telugu naming convention for Govinda’s son to be Y Ganesh Rao, for a single initial there represents the family name, rather than the father’s given name.

And so it is very likely that when the Telugu novel was adapted into a Kannada film, the names were kept the same, and so we got the Telugu convention into the Kannada movie!

The next item on today’s festival agenda was to watch Gowri Ganesha, but I need to get some work done, so the wife is watching that alone. And while some process runs I’m writing this post.

Good vodka and bad chicken

When I studied Artificial Intelligence, back in 2002, neural networks weren’t a thing. The limited compute capacity and storage available at that point in time meant that most artificial intelligence consisted of what is called “rule based methods”.

And as part of the course we learnt about machine translation, and the difficulty of getting the implicit meaning across. The favourite example by computer scientists in that time was the story of how some scientists translated “the spirit is willing but the flesh is weak” into Russian using an English-Russian translation software, and then converted it back into English using a Russian-English translation software.

The result was “the vodka is excellent but the chicken is not good”.

While this joke may not be valid any more thanks to the advances in machine translation, aided by big data and neural networks, the issue of translation is useful in other contexts.

Firstly, speaking in a language that is not your “technical first language” makes you eschew jargon. If you have been struggling to get rid of jargon from your professional vocabulary, one way to get around it is to speak more in your native language (which, if you’re Indian, is unlikely to be your technical first language). Devoid of the idioms and acronyms that you normally fill your official conversation with, you are forced to think, and this practice of talking technical stuff in a non-usual language will help you cut your jargon.

There is another use case for using non-standard languages – dealing with extremely verbose prose. A number of commentators, a large number of whom are rather well-reputed, have this habit of filling their columns with flowery language, GRE words, repetition and rhetoric. While there is usually some useful content in these columns, it gets lost in the language and idioms and other things that would make the columnist’s high school English teacher happy.

I suggest that these columns be given the spirit-flesh treatment. Translate them into a non-English language, get rid of redundancies in sentences and then  translate them back into English. This process, if the translators are good at producing simple language, will remove the bluster and make the column much more readable.

Speaking in a non-standard language can also make you get out of your comfort zone and think harder. Earlier this week, I spent two hours recording a podcast in Hindi on cricket analytics. My Hindi is so bad that I usually think in Kannada or English and then translate the sentence “live” in my head. And as you can hear, I sometimes struggle for words. Anyway here is the thing. Listen to this if you can bear to hear my Hindi for over an hour.

Smashing the Law of Conservation of H

A decade and half ago, Ravikiran Rao came up with what he called the “law of conservation of H“. The concept has to do with the South Indian practice of adding a “H” to denote a soft consonant, a practice not shared by North Indians (Karthik instead of Kartik for example). This practice, Ravikiran claims, is balanced by the “South Indian” practice of using “S” instead of “Sh”, because of which the number of Hs in a name is conserved.

Ravikiran writes:

The Law of conservation of H states that the total number of H’s in the universe will be conserved. So the extra H’s that are added when Southies have to write names like Sunitha and Savitha are taken from the words Sasi and Sri Sri Ravisankar, thus maintaining a balance in the language.

Using data from the Bangalore first names data set (warning: very large file), it is clear that this theory doesn’t hold water, in Bangalore at least. For what the data shows is that not only do Bangaloreans love the “th” and “dh” for the soft T and D, they also use “sh” to mean “sh” rather than use “s” instead.

The most commonly cited examples of LoCoH are Swetha/Shweta and Sruthi/Shruti. In both cases, the former is the supposed “South Indian” spelling (with th for the soft T, and S instead of sh), while the latter is the “North Indian” spelling. As it turns out, in Bangalore, both these combinations are rather unpopular. Instead, it seems like if Bangaloreans can add a H to their name, they do. This table shows the number of people in Bangalore with different spellings for Shwetha and Shruthi (now I’m using the dominant Bangalorean spellings).

As you can see, Shwetha and Shruthi are miles ahead of any of the alternate ways in which the names can be spelt. And this heavy usage of H can be attributed to the way Kannada incorporates both Sanskrit and Dravidian history.

Kannada has a pretty large vocabulary of consonants. Every consonant has both the aspirated and unaspirated version, and voiced and unvoiced. There are three different S sounds (compared to Tamil which has none) and two Ls. And we need a way to transliterate each of them when writing in English. And while capitalising letters in the middle of a word (as per Harvard Kyoto convention) is not common practice, standard transliteration tries to differentiate as much as possible.

And so, since aspirated Tha and Dha aren’t that common in Kannada (except in the “Tha-Tha” symbols used by non-Kannadigas to show raised eyes), th and dh are used for the dental letters. And since Sh exists (and in two forms), there is no reason to substitute it with S (unlike Tamil). And so we have H everywhere.

Now, lest you were to think that I’m using just two names (Shwetha and Shruthi) to make my point, I dug through the names dataset to see how often names with interchangeable T and Th, and names with interchangeable S and Sh, appear in the Bangalore dataset. Here is a sample of both:

There are 13002 Karthiks registered to vote in Bangalore, but only 213 Kartiks. There are a hundred times as many Lathas as Latas. Shobha is far more common than Sobha, and Chandrashekhar much more common than Chandrasekhar.


So while other South Indians might conserve H, by not using them with S to compensate for using it with T and D, it doesn’t apply to Bangalore. Thinking about it, I wonder how a Kannadiga (Ravikiran) came up with this theory. Perhaps the fact that he has never lived in Karnataka explains it.

The Comeback of Lakshmi

A few months back I stumbled upon this dataset of all voters registered in Bangalore. A quick scraping script followed by a run later, I had the names and addresses and voter IDs of all voters registered to vote in Bangalore in the state assembly elections held this way.

As you can imagine, this is a fantastic dataset on which we can do the proverbial “gymnastics”. To start with, I’m using it to analyse names in the city, something like what Hariba did with Delhi names. I’ll start by looking at the most common names, and by age.

Now, extracting first names from a dataset of mostly south indian names, since South Indians are quite likely to use initials, and place them before their given names (for example, when in India, I most commonly write my name as “S Karthik”). I decided to treat all words of length 1 or 2 as initials (thus missing out on the “Om”s), and assume that the first word in the name of length 3 or greater is the given name (again ignoring those who put their family names first, or those that have expanded initials in the voter set).

The most common male first name in Bangalore, not surprisingly, is Mohammed, borne by 1.5% of all male registered voters in the city. This is followed by Syed, Venkatesh, Ramesh and Suresh. You might be surprised that Manjunath doesn’t make the list. This is a quirk of the way I’ve analysed the data – I’ve taken spellings as given and not tried to group names by alternate spellings.

And as it happens, Manjunatha is in sixth place, while Manjunath is in 8th, and if we were to consider the two as the same name, they would comfortably outnumber the Mohammeds! So the “Uber driver Manjunath(a)” stereotype is fairly well-founded.

Coming to the women, the most common name is Lakshmi, with about 1.55% of all women registered to vote having that name. Lakshmi is closely followed by Manjula (1.5%), with Geetha, Lakshmamma and Jayamma coming some way behind (all less than 1%) but taking the next three spots.

Where it gets interesting is if we were to look at the most common first name by age – see these tables.







Among men, it’s interesting to note that among the younger age group (18-39, with exception of 35) and older age group (57+), Muslim names are the most common, while the intermediate range of 40-56 seeing Hindu names such as Venkatesh and Ramesh dominating (if we assume Manjunath and Manjunatha are the same, the combined name comes top in the entire 26-42 age group).

I find the pattern of most common women’s names more interesting. It is interesting to note that the -amma suffix seems to have been done away with over the years (suffixes will be analysed in a separate post), with Lakshmamma turning into Lakshmi, for example.

It is also interesting to note that for a long period of time (women currently aged 30-43), Lakshmi went out of fashion, with Manjula taking over as the most common name! And then the trend reversed, as we see that the most common name among 24-29 year old women in Lakshmi again! And that seems to have gone out of fashion once again, with “modern names” such as Divya, Kavya and Pooja taking over! Check out these graphs to see the trends.

(I’ve assumed Manjunath and Manjunatha are the same for this graph)

So what explains Manjunath and Manjula being so incredibly popular in a certain age range, but quickly falling away on both sides? Maybe there was a lot of fog (manju) over Bangalore for a few years? 😛

The popularity of nicknames and political correctness

It is a rite of passage in an institution such as IIT (Indian Institute of Technology) that a first year student be given a potentially embarrassing nickname following “interaction” with senior students. The profundity of these nicknames varies significantly, with some people simply being given names that correspond to body parts in different languages, which others have more involved names.

Based on a conversation yesterday, the hypothesis is that more profound nicknames which are embarrassing only in a particular context are more likely to propagate, and thus stick, while the more crass names are likely to die out more easily.

The logic is simple – the crass names (a few examples being “lund”, “condom” and “dildo” – there is at least one person with each of these names in every hostel of every batch at IIT Madras) are potentially embarrassing for an “outsider” to use, and to be used in public. So when the bearer of such a name graduates and moves on to a new setting, the new people he encounters make a prudent choice to not use the embarrassing word, and the nickname dies a quick death.

When the nickname is embarrassing or derogatory for more contextual reasons, though, the name quickly loses its context and becomes incredibly simple for people to use. Take my own name “Wimpy”, for example – not too many people know it has an embarrassing origin, and it is a perfectly respectable word to shout out in public, or even in an office setting. And so it has propagated – in at least two offices I worked in, everyone called me “Wimpy”.

It is similar for lots of other “benign” names. But it is unlikely that a name like “condom” or “dildo” will propagate, and it is in fact more likely that even the people who bestowed such names upon the unsuspecting will stop using them once everyone graduates and moves on to a more formal environment.

There are exceptions, of course, a notable one being “Baada“. It is a cuss-word representing a body part, except that it is in a non-standard (though not small by any means) language, but everyone I know calls Baada Baada. He used to be my colleague, and people at work also called him Baada. It is unlikely that his nickname would’ve propagated, though, had it been the synonym in English or Hindi.

Thanks to Katpadi Katsa for discussions leading up to this post. In a future post, I’ll talk about models for propagation of nicknames across institutions.