How power(law)ful is your job?

A long time back I’d written about how different jobs are sigmoidal to different extents – the most fighter jobs, I’d argued, have linear curves – the amount you achieve is proportional to the amount of effort you put in. 

And similarly I’d argued that the studdest jobs have a near vertical line in the middle of the sigmoid – indicating the point when insight happens. 

However what I’d ignored while building that model was that different people can have different working styles – some work like Sri Lanka in 1996 – get off to a blazing start and finish most of the work in the first few days. 

Others work like Pakistan in 1992 – put ned for most of the time and then suddenly finish the job at the last minute. Assuming a sigmoid does injustice to both these strategies since both these curves cannot easily be described using a sigmoidal function. 

So I revise my definition, and in order to do so, I use a concept from the 1992 World Cup – highest scoring overs. Basically take the amount of work you’ve done in each period of time (period can be an hour or day or week or whatever) and sort it in descending order. Take the cumulative sum. 

Now make a plot with an index on the X axis and the cumulative sum on the Y axis. The curve will look like that if a Pareto (80-20) distribution. Now you can estimate the power law exponent, and curves that are steeper in the beginning (greater amount of work done in fewer days) will have a lower power law exponent. 

And this power law exponent can tell you how stud or fighter the job is – the lower the exponent the more stud the job!! 

Newsletter!

So after much deliberation and procrastination, I’ve finally started a newsletter. I call it “the art of data science” and the title should be self-explanatory. It’s pure unbridled opinion (the kind of which usually goes on this blog), except that I only write about one topic there.

I intend to have three sections and then a “chart of the edition” (note how cleverly I’ve named this section to avoid giving much away on the frequency of the newsletter!). This edition, though, I ended up putting too much harikathe, so I restricted to two sections before the chart.

I intend to talk a bit each edition about some philosophical part of dealing with data (this section got a miss this time), a bit on data analysis methods (I went a bit meta on this this time) and a little bit on programming languages (which I used for bitching a bit).

And that I plan to put a “chart of the edition” means I need to read newspapers a lot more, since you are much more likely to find gems (in either direction) there than elsewhere. For the first edition, I picked off a good graph I’d seen on Twitter, and it’s about Hull City!

Anyway, enough of this meta-harikathe. You can read the first edition of the newsletter here. In case you want to get it in your inbox each week/fortnight/whenever I decide to write it, then subscribe here!

And put feedback (by email, not comments here) on what you think of the newsletter!

High dimension and low dimension data science

I’ve observed that there are two broad approaches that people take to getting information out of data. One approach is to simply throw a kitchen sink full of analytical techniques at the data. Without really trying to understand what the data looks like, and what the relationships may be, the analyst simply uses one method after another to try and get insight from the data. Along the way, a “model” will get built.

The other approach (which I’m partial to) involves understanding each variable, and relationship between variables as a first step to getting insight from the data. Here, too, a model might get built, but it will be conditional on the analyst’s view on what kind of a model might suit the data after looking at the data.

Considering that both these approaches are used by large numbers of analysts, it is highly likely that both are legitimate. Then what explains the fact that some analysts use one approach, and others use another? Having thought about it for a long time, I have a hypothesis – it depends on the kind of data being analysed. More precisely, it has to do with the dimensionality of the data.

The first approach (which one might classify as “machine learning”) works well when the data is of high dimensions – where the number of predictors that can be used for predictors is really large, of the order of thousands or larger. For example, even a seemingly low-resolution 32 by 32 pixel image, looked at as a data point, has 1024 dimensions (colour of the image at each of the 1024 pixels is a different dimension).

Moreover, in such situations, it is likely that the signal in the data doesn’t come from one, or, two, or a handful of predictors. In high dimension data science, the signal usually comes from complex interplay of data along various dimensions. And this kind of search is not something humans are fit for – it is best that the machines are left to “learn” the model by themselves, and so you get machine learning.

On the other hand, when the dimensionality of the dataset is low, it is possible (and “easy”) for an analyst to look at the interplay of factors in detail, and understand the data before going on to build the model. Doing so can help the analyst identify patterns in the data that may not be that apparent to a machine, and it is also likely that in such datasets, the signal  lies with data along a small number of dimensions, where relatively simple manipulation will suffice. The low dimensionality also means that complex machine learning techniques are unlikely to contribute much in such cases.

As you might expect, from an organisational perspective, the solution is quite simple – to deploy high-dimension data scientists on high-dimension problems, and likewise with low-dimension data scientists. Since this distinction between high-dimension and low-dimension data scientists isn’t very well known, it’s quite possible that the scientists might be made to work on a problem of dimensionality that is outside of their comfort zone.

So when you have low dimensional data scientists faced with a large number of dimensions of data, you will see them use brute force to try and find signals in bivariate relationships in the data – an approach that will never work since the signal lies in a more complex interplay of dimensions.

On the other hand, when you put high dimension data scientists on a low dimension problem, you will either see them missing out on associations that a human could easily find but a machine might find hard to find, or they might unnecessarily “reduce the problem to a known problem” by generating and importing large amounts of data in order to turn it into a high dimension problem!

PS: I wanted to tweet this today but forgot. Basically, you use logistic regression when you think the signal is an “or” of conditions on some of the underlying variables. On the other hand, if you think the signal is more likely to be an “and” condition of certain conditions, then you should use decision trees!

 

Tiered equity structure and investor conflict

About this time last year, I’d written this article for Mint about optionality in startup valuations. The basic idea there was that any venture capital investment into startups usually comes with “dirty terms” that seek to protect the investor’s capital.

So you have liquidity preferences that demand that the external investors get paid out first (according to a pre-decided formula) in case of a “liquidity event” (such as an IPO or an acquisition). You also have “ratchets”, which seek to protect an investor’s share in the company in case the company raises a subsequent round at a lower valuation.

These “dirty terms” are nothing but put options written by existing investors in a firm in favour of the new investors. And these options telescope. So the Series A round has options written by founders, employees and seed investors, in favour of Series A investors. At the time of Series B, Series A investors move to the short (writing) side of the options, which are written in favour of Series B investors. And so forth.

There are many reasons such clauses exist. One venture capitalist told me that his investors have similar optionality on their investments in his funds, and it is only fair he passes them on. Another told me that “good entrepreneurs” believe in their idea so much that they don’t want to even consider the thought that their company may not do well – which is when these options pay out, and so they are happy to write these options. And then you know that an embedded option can increase the optics of the “headline valuation” of a company, which is something some founders want.

In any case, in my piece for Mint I’d written about such optionality leading to potential conflicts among investors in different classes of stock, which might sometimes be a hindrance to further capital raises. Quoting from there,

The latest round of investors usually don’t mind a “down round” (an investment round that values the company lower than the preceding round) since their ratchets protect them, but earlier investors are short such ratchets, and don’t want to see their stakes diluted. Thus, when a company is unable to find investors who are willing to meet its current round of valuation, it can lead to conflict between different sets of investors in the company itself.

And now Mint reports that such conflicts are a main reason for Indian e-commerce biggie Snapdeal’s recent struggles, which has led to massive layoffs and a delay in funding. The story has played out exactly as I’d written in the paper last year.

Softbank, which invested last in Snapdeal and is long put options on the company’s value, is pushing the company to raise more funds at a lower valuation. However, Nexus and Kalaari, who had invested earlier and stand to lose significantly thanks to these options, are resisting such moves. And the company continues to stall.

I hope this story provides entrepreneurs and venture capitalists sufficient evidence that dirty terms can affect everyone up and down the chain, and can actually harm the business’s day-to-day operations. The cleaner a company keeps the liabilities side of the balance sheet (in having a small number of classes of equity), the better it is in the long run.

But then with Snap having IPOd by offering only non-voting shares to the public, I’m not too hopeful of equity truly being equitable any more!

Explaining the lack of dishwashers in India

For the last four weeks, after landing in Britain, we’ve been using the dishwasher fairly regularly. On an average, we run it once a day, and the vessels come out of it nice and shiny – to an extent that is nearly impossible when you wash them by hand. Last year when we were in Spain, too, we used the dishwasher fairly often.

Considering the convenience (all your dishes done in one go, and coming out nice and shiny), I’ve been wondering why the dishwasher hasn’t taken off in India. The requirement for water and electricity doesn’t explain it – the near-ubiquity of the washing machine in upper middle class households suggests that is not that much of a problem. It’s not a function of our using steel plates, either – if that were the only constraint, people would have switched plates to get the benefit of this convenience.

The real answer lies in the archaic concept of the enjil (saliva; known as jooTa in Hindi), and theories on how saliva can get transmitted and contaminate stuff. To be fair, it’s a useful concept in a way that it doesn’t allow anyone’s germ-bearing saliva to contaminate things around them, except for roads and sidewalks that is! Specifically, the enjil concept ensures that food doesn’t get remotely contaminated by someone’s saliva. But it takes things a bit too far.

For example, sharing plates, even when you’re using separate spoons (let’s saw when sharing dessert at a restaurant), is taboo. When you double-dip your spoon into the plate, germs from your saliva get transmitted there, and can potentially contaminate people you are sharing your food with. Or so the theory goes. The exceptions are in childhood, where a child is allowed to share plates with the mother, and after marriage, when couples are allowed to share plates! Go figure how that works.

Similarly, traditional Indians eschew the dining table, and the concept of keeping serving bowls on the same surface as plates. Again, the concept is that saliva can somehow “transmit” from the plates to the serving bowls and contaminate everyone’s food.

Next, there is an elaborate protocol to deal with used plates. They are not supposed to be washed in the same sink as other vessels. Yes, you read that right. When I was growing up, the protocol for used plates was to first rinse them in the bathroom (after throwing leftover food in the dustbin) before dropping them in the sink. It didn’t matter how well you rinsed the plate in the bathroom – that water had fallen on it after your usage would indicate that it was now purified, and fit to sit with all the other unwashed vessels.

Now consider the dishwasher. To achieve economies of scale at the household level, and to ensure vessels don’t pile up, you put all kinds of vessels in it at the same time – plates, spoons, forks, serving bowls and  cooking vessels! In other words, “saliva-bearing” dishes are put into the same contraption at the same time as “saliva-free” cooking dishes, and the “same water” is used to wash all of them together.

And that clearly violates all prudent practices of saliva management and contamination avoidance that we have all grown up with! And trust me, it takes time to get over such instinctive practices one has grown up with. And so I predict that it will at least be another generation (20 years or so) when there are sufficient households with adults who grew up without a strong concept of enjil, and who might be willing to give the dishwasher a try!

Mata Amrita Index needs a new dimension

Some of the hugs look too flimsy for a 10-year reunion
Pinky

As anyone here who has tried to construct an index will know, any index, however well constructed, will end up being way too simplistic, and abstract away way too much information. This is especially true of indices that are constructed as weighted averages of different quantities, but even indices with more “fundamental” formulae are not immune to this effect.

Some eight years ago, I constructed an index called the “Mata Amrita Index“, which my good friend Sangeet describes as the “best ever probabilistic measure” he’s come across. It’s exactly that – a probabilistic measure.

Quoting from the blog post where I introduced the concept:

The Mata Amrita Index for a person is defined as the likelihood of him or her hugging the next random person he/she meets.

Actually over time I’ve come to prefer what I’d called the “bilateral MAI”, which is the probability that a given pair of people will hug each other the next time they meet. The metric has proved more useful than I had initially imagined, and has in a way helped me track how some friendships are going. So far so good.

But it has a major shortcoming – it utterly fails to capture quality. There are some people, for example, who I don’t hug every time I meet them, but on the random occasions when we do hug, it turns out to be incredibly affectionate and warm. And there are some other people, with whom my bilateral MAI tends to 1, but where the hug is more of a ritual than a genuine expression of affection. We hug every time, but the impact of the hug on how I feel is negligible.

In fact, I’d written about this a couple years back, that when the MAI becomes too high, the quality and the impact of the hug inevitably suffers. Apart from the ritualness of the hug robbing it of the warmth, a high MAI also results in lack of information flow – you know you hug as a rule, so the hug conveys no information.

So, now I want to extend the MAI (all good index builders do this – try to extend it when they realise its inadequacies) to incorporate quality as well. And like any index extension, the problem is to be able to achieve this without making the index too unwieldy. Right now, the index is a probabilistic measure, but not that hard to understand. It’s also easy to adjust your bilateral MAI with someone every time you meet.

How do you think I can suitably modify the MAI to bring in the quality aspect? One measure I can think of is “what proportion of the time when you meet do you hug, and it makes you feel real good?”. As you can see it’s already complicated, but this brings in the quality component. The ritual hug with the high MAI counterparty makes no impact on you, so your modified MAI with that person will be low.

The problem with this Modified MAI (MMAI) is that it is automatically capped by the MAI, given the “AND” condition in its definition. So a person you hug infrequently, but feel incredibly good after each such hug, will have a low MMAI with you – it’s more to do with the low frequency of hugging than the quality.

If you can think of a more elegant measure, do let me know! Whoever said building an index is a simple process!

Letters to my Berry #6

So you turn half today. And I’ll let you write this letter yourself, since over the last week or two you’ve been trying to get to the computer and operate it.

So long, bangalore. We're off to wash our arses in the Thames.

A post shared by Karthik S (@skthewimp) on

OK I gave you two minutes, but unlike what you’ve been doing over the last one week, you didn’t try to get to the computer today, so I’ll write this myself.

This last month has been one of big change, as you made your first forin trip. It was a mostly peaceful flight from Bangalore to London, via Dubai. You hardly cried, though you kept screaming in excitement through the flight, and through the layover in Dubai. Whenever someone smiled at you, you’d attempt to talk to them. And it would get a bit embarrassing at times!

Anyway, we got to London, and we had to put you in day care. The first day when I left you at the day care for a one-hour settling in session, I cried. Amma was fine, but I had tears in my eyes, and I don’t know why! And after two settling-in sessions, you started “real day care”, and on the first day it seems you were rather upset, and refused to eat.

So I had to bring you home midway through the day and feed you Cerelac. It was a similar story on the second day – you weren’t upset, but you still wouldn’t eat, so I had to get you home and give you Cerelac. It was only on the third day, that is today, that you finally at ate at the nursery!

The biggest challenge for us after bringing you to London has been to keep you warm, since you refuse to get the concept of warm clothes, and refuse to wear them. And so for the last 10 days you’ve not only got a cold and cough for yourself, but you’ve also transmitted it to both Amma and me 🙁

London has also meant that you’ve started travelling by pram regularly, though after one attempt we stopped taking it on the Underground since it was difficult to negotiate steps. When we have to take the train, I thus carry you in your baby carrier, like a baby Kangaroo!

In the last one month, you’ve also made significant motor improvements. You still can’t sit, but you try to stand now! It seems like you’ve taken after Amma and me in terms of wanting to take the easy way out – you want to stand without working hard for it, and sometimes scream until we hold you up in a standing position.

Your babbles have also increased this month, and we think you said “Appa” a few times in the last one week in the course of the last one week! Maybe I like to imagine that you say it, and maybe you actually call me that! It’s too early to say!

Finally, one note of disappointment – on Monday when you were all crying and upset and refusing to eat at the nursery, I rushed to pick you up, and hoped to see you be very happy when you saw me. As it turned out, you gave me a “K dear, you are here” kind of expression and just came home! Yesterday you actually cried when I came to pick you up!

It seems like you’re becoming a teenager already! And you’re just half!!