Python and Hindi

So I’ve recently discovered that using Python to analyse data is, to me, like talking in Hindi. Let me explain.

Back in 2008-9 I lived in Delhi, where the only language spoken was Hindi. Now, while I’ve learnt Hindi formally in school (I got 90 out of 100 in my 10th boards!), and watched plenty of Hindi movies, I’ve never been particularly fluent in the language.

The basic problem is that I don’t know the language well enough to think in it. So when I’m talking Hindi, I usually think in Kannada and then translate my thoughts. This means my speech is slow – even Atal Behari Vajpayee can speak Hindi faster than me.

More importantly, thinking in Kannada and translating means that I can get several idioms wrong (can’t think of particular examples now). And I end up using the language in ways that native speakers don’t (again can’t think of examples here).

I recently realised it’s the same with programming languages. For some 7 years now I’ve mostly used R for data analysis, and have grown super comfortable with it. However, at work nowadays I’m required to use Python for my analysis, to ensure consistency with the rest of the firm.

While I’ve grown reasonably comfortable with using Python over the last few months, I realise that I have the same Hindi problem. I simply can’t think in Python. Any analysis I need to do, I think about it in R terms, and then mentally translate the code before performing it in Python.

This results in several inefficiencies. Firstly, the two languages are constructed differently and optimised for different things. When I think in one language and mentally translate the code to the other, I’m exploiting the efficiencies of the thinking language rather than the efficiencies of the coding language.

Then, the translation process itself can be ugly. What might be one line of code in R can sometimes take 15 lines in Python (and vice versa). So I end up writing insanely verbose code that is hard to read.

Such code also looks ugly – a “native user” of the language finds it rather funnily written, and will find it hard to read.

A decade ago, after a year of struggling in Delhi, I packed my bags and moved back to Bangalore, where I could both think and speak in Kannada. Wonder what this implies in a programming context!

Censoring the death ceremony

So we finally watched Raam Reddy’s much-acclaimed Thithi today. Ever since we’d watched the trailer, we’d wanted to see the movie, and though reviews from relatives and friends were mixed, they helped set our expectations and we had a good time at the movie.

This post, however, is not about the movie, but about censorship. We watched at PVR Forum, and immediately after the U/A certificate (and before the movie) came a certificate with the cuts that the censor board had recommended. Even before the movie began, we knew that four instances of thika (arse) and one instance of bOLi (bitch) had been muted.

I think this is a fantastic idea – while the censor board is happy to use its scissors liberally, showing how they’ve used their scissors beforehand helps set viewers’ expectations, so that they know exactly what they’ve missed out. My only contention is that that slide should be shown for longer than it was, so that viewers get a better idea.

Anyway, once the movie started, it was clear that the censors had done a shoddy job. As a friend (who watched the movie yesterday) pointed out, the word “tuNNe” (dick) wasn’t muted out. I noticed during the movie that there is a dialogue that is translated (and subtitled) as “screw your mother” remained.

(while I initially wondered why a Kannada movie was being shown in Bangalore with English subtitles, I realised once the movie started that it was a good thing. The language used in the movie was quite different from what we normally speak in Bangalore.)

What the censorship of words in this movie goes to illustrate is that the censor board is thoroughly incompetent. Whether censorship is necessary is a philosophical question, and the government has appointed a committee to look into that. What is more important is that the people at the censor board are thoroughly incompetent, and hopefully that will be taken into account when the censorship policy is finally revised!

thika is something every Kannadiga kid uses liberally (though bOLi is something we graduate to only in teens), while tuNNe and nin-amman (translated as “screw your mother”) are normally not used in polite conversation. The censor board is absolutely clueless!

Gandhi

I was playing table tennis in my hostel at IIT with a friend who came from North India. At some point during a rally, the ball hit the edge of the table on his side, and moved far away, giving me the point. I apologised (when you normally do when you win a point by fluke), and said “Gandhi”. He didn’t understand what that meant.

It was then that I realised that using the word “Gandhi” as a euphemism for “fluke” is mostly a Bangalore thing. Back when I played table tennis during my school days, a let was called “Gandhi”, as was a ball hitting the edge of the table. It was the same case with comparable sports such as badminton or tennis or even volleyball. A basket that went in by fluke in basketball was also “Gandhi”.

Now, it might be hard for people to reconcile flukes with MK Gandhi, who was assassinated sixty eight years ago. Some people might also find it repugnant – that the great Mahatma’s name might be used to describe flukes. Looking at it as a fluke, however, is a shallow interpretation.

While it is hard to compare Gandhi (the person) with flukes, it is not hard at all to look at him as a figure of benevolence. He was known for his non-violent methods, and for turning the proverbial “other cheek”. He pioneered the use of non-cooperation as a method of protest (which has unfortunately far outlived its utility) and showed that you could win by being extremely nice. This was channelled in a movie a decade ago which spoke about “Gandhigiri” as a strategy for world domination.

So when the table tennis ball hits the edge of the table and flies off, invoking Gandhi’s name is a sign of benevolence by the person who has lost the point, who implicitly says “you, bugger, didn’t deserve to win this point. But I’ll be benevolent like Gandhi and allow you to take it”. It is similar in other sporting contexts, such as a let or a freak basket.

The invocation of Gandhi’s name as a sign of benevolence is common in other fields as well. In 1991, my cousin had to miss her second standard annual exams as she had to fly to Bangalore on account of the death of the grandfather we shared. Her school, in an act of benevolence, promoted her anyway, an act that was described by other relatives in Bangalore as “Gandhi pass”.

If there is a Gandhi pass, there is a Gandhi class also (again I was surprised to know it’s not a thing in North India). Another of Gandhi’s defining characteristics was the simplicity of his life. Though he could afford to travel better, he would always travel third class, which had the cheapest ticket. As a consequence, the cheapest ticket came to be known as the “Gandhi class”.

The term (Gandhi class) is now most commonly used in the context of cinemas, referring to the front few rows for which tickets are the cheapest. Even though multiplexes have larger blocks nowadays, which means front row tickets are no cheaper than those a few rows behind, the nomenclature sticks. If you are unlucky enough to only get a seat in the first couple of rows, you proudly say you are in “Gandhi class”.

That his name has come to be associated with so many everyday occurrences, mostly in irreverence, illustrates the impact Gandhi has had. Some people might outrage (as the fashion is nowadays) about the irreverence, and “reduction” of Gandhi to these concepts.

I’m still surprised, though, that things like “Gandhi class”, “Gandhi pass” and “Gandhi” as a euphemism for fluke weren’t that prevalent in North India fifteen years ago.

English and phonetic spellings

So my nephew Samvit, who recently turned 4, has learnt to spell. And he has learnt to use a computer (and phone) keyboard. He seems to love the keyboard so much that he apparently refuses to write using pen and paper.

They say that he’s taken after me in many ways (despite us sharing just 1/16 of our genes – he’s my cousin’s son), and I must mention that my writing output exploded after I had learnt to type and got access to a computer keyboard.

The point of this post is not about his writing, however. Yesterday, they made him spell out a few words, and here is how he spelt them out. One thing I might want to disown him for is that he uses all caps. Leaving that aside, the way he spells is extremely interesting. Here is the list of words he spelt out, as emailed to me by his mother:

THIORI
ELEKTRIC
MAGNET
ANTENA
MYKRO
STRIP
HELIX
PERABOLA
DYPOLE
HORN
GYD
COSMIK
PLANET
ANIMUL
DANS
SING
CUK
DRAMA
MUZIC
HOUS
TEMPUL
SOUND
SOFA
WATUR
AEROPLEN
SHIPYARD
GARDUN
CHOKLET
BRED
JUS
BANANA
ORENJ
AVACADO
ORIYO

As you might notice, it’s all very phonetic. He has learnt the English alphabet, and sounds associated with each letter, and then tried to fit that to the words that he has had to type out. It appears weird at first, but then if you take a closer look, you realise that it’s rather intuitive.

He seems to have figured out the polymorphism behind certain letters, for he uses multiple sounds of U in “Jus” (which is how I think it’s spelt in certain European languages, btw) and in “Gardun”. He hasn’t figured out the polymorphism in i-y though, as he says “thiori” and “gyd”.

Then his use of Cs and Ks for the Ka sound is also interesting, as he uses both of them, and he seems to have a certain logic for using them. I’ve been trying to reverse engineer this logic but so far failed. He says “cuk” and “cosmik”, from which you might think he uses “c” when its the beginning of a syllable and “k” when it’s the end of a syllable.

But then you also notice that he says “elektric” which throws this hypothesis out of the window. And there is “avacado” and “choklet”.

Overall, though, it is fascinating to see how a four-year-old who has just learnt the language spells, Maybe if we get a bunch of four-year-olds who still haven’t been formally taught to spell to spell, we might understand what English spelling should intuitively be like! It might even be possible that going forward the language may evolve to this new spelling!

Are there any other interesting patterns you notice in the other list of words? Are there any other interesting ways in which you’ve seen other kids spell? What does this mean for the English language – should it be simplified?

Ghoti

Days of the week in Bahasa

Most languages name  their days of the week after a single source, and this is usually consistent across languages. For example, the original Latin names for the days of the week came from “planets” – Sun, Moon, Mars, Mercury, Jupiter, Venus and Saturn respectively. And this got copied into various languages.

So the days of the week as we know in English are derived from the names of these planets or Gods representing them (Thor giving Thursday and so on). Indian names for the days of the week are direct translations of the Latin names. And some days have multiple names in Indian languages, all of which mean the same thing.

So you have Ravivara and Bhaanuvara and Adityavara, all of which refer to Sunday, and all of which precisely translate to “Sun day”. The more formal name for Thursday is “bRhaspativara” but more commonly referred to as “Guruvara”, with “Guru” being the more common name for bRhaspati. And so forth.

Based on this background, I found the names of the week in Bahasa Indonesia, which I observed from signboards (Bahasa uses Roman scripts, so one level of Rosetta stoning can happen from signboards), rather interesting.

The names are (starting with Sunday):
Minggu
Senin
Selasa
Rabu
Kamis
Jumat
Sabtu

Ok I got that from this link as I was writing, but what I got from signboards yesterday was the names of Friday, Saturday and Sunday (Jumat, Sabtu and Minggu respectively). And I found it fascinating since it seems like they come from multiple sources.

So Jumat, it appears, is the day of prayer, or Juma. Considering that Indonesia is a Muslim-majority country (it’s not funny how empty restaurants are during lunch nowadays, since it’s Ramzan), naming Friday as “the day of prayer”, using the Muslim word for prayer, is absolutely logical.

Sabtu for Saturday is obviously derived from “Sabbath” – another day of prayer but for a different religion (Judaism). It looks like it’s derived from European names for Saturday – Saturday in Spanish is Sabado, for instance. So actually, in this case we are seeing a wider adoption of naming the day of week after its religious significance than the associated planet.

And Minggu, it appears, is diminutive for Domingo, the Spanish and Portuguese word for Sunday (and perhaps there are similar names in other European languages). And it appears that “Domingo” has nothing to do with the Sun, but instead is derived from Latin for “God’s day” (since Sunday is the day of the Christian God, who famously took rest on that day).

So it’s interesting that Bahasa has names for three days of the week which are not based on the planets, but on different versions of “God’s day”, with multiple origins among them! Or rather, that Bahasa has three “God’s day”s, with each referring to a different god.

I’m reminded of this store that existed a long time back close to where I currently live. It was called “yellAdEvarakRpe stores” (store with the grace of all gods).

Languages as memes

A while back on this blog I had compared religious and cultural practices to memes (in the original Richard Dawkins sense of the word). Back then I had written:

So if you were to look at it in terms of responsibility to society, you need to propagate only those cultural traits that you deem to be relevant and important. “So what if everyone stops celebrating Ganesh Chaturthi?” you may ask. If that would happen that would simply mean a vote of no confidence for the festival and an indication that the festival needs to be phased out. If everyone were to propagate only those cultural traits they find useful, traits that a significant proportion of society finds significant will continue to survive and thrive. For Ganesh Chaturthi to exist 30 years hence, it isn’t necessary for ALL families that have inherited it to celebrate it now. As long as a critical mass of families celebrate it, the festival will survive. If not, it probably doesn’t need to exist.

 

Now, thinking about it, you can consider language to also be a meme. When a bunch of you find that there is a concept for which the language you speak in has no word, you invent a word and add it to the language (this is like a genetic mutation). If enough people like this mutation (i.e. if it is “fit”) it will propagate, and soon become part of the language.

If there is a word in the language that is archaic and not useful for describing any of the phenomena that you are likely to encounter, you stop using it. When people stop using such words, they become “archaic” (ok I see circular reasoning in this paragraph) and effectively drop out of the language. Thus, a living language is always dynamic, receptive to new words (to describe concepts that earlier didn’t need description) and receptive to discarding words that are not useful any more. Thus, the feature that defines a living language is dynamism and change.

This has several policy implications.

1. The concept of “purity” of language is wrong. Some people want to speak in the “pure form” of a language. As long as it is a language that has been truly alive (and not kept alive mostly by ancient literature) there exists no “pure form”, for the definition of a successful language involves frequent “mutations”. So if you ask me to talk in “pure Kannada” it is nonsense. Pure Sanskrit, on the other hand, has some meaning, for the language has been so little used that it’s stopped evolving and mutating.

2. People like to appoint themselves guardians of culture and dictate top-down what words should be part of a particular language. For example, there exists a body under the Government of Karnataka (if I’m not wrong) which dictates what “Kannada words” must be used for different new concepts. This is wrong, and a recipe for such words not being used.

Instead, “memetics” must be respected and evolution must be bottom up. People find the need to describe phenomena around themselves and if they don’t find a word in their language that describes it, they will either invent or borrow one such word. Some such new words become widely used, at which point of time they can be introduced into the language dictionary. Usage should precede presence in the dictionary, not the other way round.

3. “Slang” is a part of language, and a leading indicator of how the language is going to evolve. It should be encouraged and not denounced. For it exists because the language as it stands now cannot effectively enough describe certain concepts.

I’m currently reading this book called The Information by James Gleick, which has a chapter or two dedicated to languages and dictionaries. It was while reading it that I realised how languages are memes.

 

Switching languages

I used to marvel about how whenever I was in the company of other people from IIT Madras, I would instinctively switch to speaking “IITese“. Words such as “slisha”, “peace”, “rod”, and all others that I would not normally use in normal English when speaking to normal people would suddenly appear in my vocabulary while talking to others from IITM.

I used to consider myself special that I could discriminate thus, and make best use of the languages I know while not discriminating against people who didn’t understand one of the languages, such as IITese. I used to consider this great, but this bubble got broken when my nephew started talking.

This guy is half-Kannadiga, half-Marathi, with a Gult nanny and his parents speak to each other in Hindi. He is now three years old and for over a year now he’s been very comfortable speaking Kannada and Marathi, and to an extent Telugu, Hindi and English (which he’s learning in school) !  The most remarkable thing with him, though, (as with all other multilingual kids, I would imagine) is that he has mapped people to languages. For example, he knows that I speak Kannada and he speaks to me only in Kannada. And while talking to me if his father (who is Marathi) is present, he immediately switches to Marathi to talk to him. Across languages that are very different, he is able to switch easily and seamlessly and moreover know who speaks which language!

There is a downside, though. Once when his mother, who is “supposed to speak to him in Kannada”, tried talking to him in Marathi, he got really angry and wild and asked her to speak in Kannada! Our initial thought was he was being finicky, but I now think it is to do with parsing. When his mother speaks, he has his “Kannada parser” switched on, and if she speaks Marathi, there is a parsing error and it causes great stress on his processor to switch languages. And being a small kid, that makes him cranky and wild!

In other words, this can be considered as another case of Bayesian recognition! It seems like the human mind’s parsing of speech is influenced by the prior distribution of what language the speaker is speaking in. As the first few words come out, we firm up which parser to use, and then it is smooth sailing. For a kid, though, it seems like the prior distribution of parsers is “binary” (one 1, and the rest 0s), which is what makes the wrong speaker wrong language combo annoying for them!

Us human beings are smarter than we think!