How power(law)ful is your job?

A long time back I’d written about how different jobs are sigmoidal to different extents – the most fighter jobs, I’d argued, have linear curves – the amount you achieve is proportional to the amount of effort you put in. 

And similarly I’d argued that the studdest jobs have a near vertical line in the middle of the sigmoid – indicating the point when insight happens. 

However what I’d ignored while building that model was that different people can have different working styles – some work like Sri Lanka in 1996 – get off to a blazing start and finish most of the work in the first few days. 

Others work like Pakistan in 1992 – put ned for most of the time and then suddenly finish the job at the last minute. Assuming a sigmoid does injustice to both these strategies since both these curves cannot easily be described using a sigmoidal function. 

So I revise my definition, and in order to do so, I use a concept from the 1992 World Cup – highest scoring overs. Basically take the amount of work you’ve done in each period of time (period can be an hour or day or week or whatever) and sort it in descending order. Take the cumulative sum. 

Now make a plot with an index on the X axis and the cumulative sum on the Y axis. The curve will look like that if a Pareto (80-20) distribution. Now you can estimate the power law exponent, and curves that are steeper in the beginning (greater amount of work done in fewer days) will have a lower power law exponent. 

And this power law exponent can tell you how stud or fighter the job is – the lower the exponent the more stud the job!! 

Gossip Propagation Models

More than ten years ago, back when I was at IIT Madras, I considered myself to be a clearinghouse of gossip. Every evening after dinner I would walk across to Sri Gurunath Patisserie, and plonk myself at one of the tables there with a Rs. 5 Nescafe instant coffee. And there I would meet people. Sometimes we would discuss ideas (while these discussions were rare, they were most fulfilling). Other times we would discuss events. Most of the time, and in conversations that would be entertaining if not fulfilling, we discussed people.

Constant participation in such discussions made sure that any gossip generated anywhere on campus would reach me, and to fill time in subsequent similar conversations I would propagate them. I soon got to know about random details of random people on campus who I hardly cared about. Such information was important purely because someone else might find it interesting. Apart from the joy of learning such gossip, however, I didn’t get remunerated for my services as clearinghouse.

I was thinking about this topic earlier today while reading this studmax post that the wife has written about gossip distribution models. In it she writes:

This confirmed my earlier hypothesis that gossip follows a power law distribution – very few people hold all the enormous hoards of information while the large majority of people have almost negligible information. Gossip primarily follows a hub and spoke model (eg. when someone shares inappropriate pictures of others on a whatsapp group) and in some rare cases especially in private circles (best friends, etc.), it’s point to point.

 

For starters, if you plot the amount of gossip that is propagated by different people (if a particular quantum of gossip is propagated to two different people, we will count it twice), it is very well possible that it follows a power law distribution. This well follows from the now well-known result that degree distribution in real-world social networks follows a power law distribution. On top of this if you assume that some people are much more likely to propagate quantums of gossip they know to other people, and that such propensity for propagation is usually correlated with the person’s “degree” (number of connections), the above result is not hard to show.

The next question is on the way gossip actually propagates. The wife looks at the possibilities through two discrete models – hub-and-spoke and peer-to-peer. In the hub-and-spoke models, gossip is likely to spread along the spokes. Let us assume that the high-degree people are the hubs (intuitive), and according to this model, these people collect gossip from spokes (low degree people) and transmit it to others. In this model, gossip seldom propagates directly between two low-degree people.

At the other end is the peer-to-peer model where the likelihood of gossip spreading along an edge (connection between two people) is independent of the nature of the nodes at the end of the edge. In this kind of a model, gossip is equally likely to flow across any edge. However, if you overlay the (scale free/ power law) network structure over this model, then it will start appearing to be like a hub and spoke model!

In reality, neither of these models is strictly true since we also need to consider each person’s propensity to propagate gossip. There are some people who are extremely “sadhu” and politically correct, who think it is morally wrong to propagate unsubstantiated stories. They are sinks as far as any gossip is concerned. The amount of gossip that reaches them is also lower because their friends know that they’re not interested in either knowing or propagating it. On the other hand you have people (like I used to be) who have a higher propensity of propagating gossip. This also results in their receiving more gossip, and they end up propagating more.

So does gossip propagation follow the hub-and-spoke model or peer-to-peer model? The answer is “somewhere in between”, and a function of the correlation between the likelihood of a node propagating gossip and the degree of the node. If the two are uncorrelated (not unreasonable), then the flow will be closer to peer-to-peer (though degree distribution being a power law makes it appear as if it is hub-and-spoke). If there is very high positive correlation between likelihood of propagation and node degree, the model is very close to hub-and-spoke, since the likelihood of gossip flowing between low degree nodes in such a case is very very low, and thus most of the gossip flow happens through one of the hubs. And if the correlation between likelihood of propagation and node degree is low (negative), then it is likely to lead to a flow that is definitely peer-to-peer.

I plan to set up some simulations to actually study the above possibilities and further model how gossip flows!

Poverty and distributions

No, this post is not about the distribution of poverty. This is a rather technical post about probability distributions. Just that it has something to add to the poverty debate. And like the previous post, this is a departure from the normal RQ-type posts – there will be no graphs, no tables. Just theorizing.

So in the last week or two a lot of op-ed space in India has been consumed by what is described as the “poverty debate”. A recent survey by the National Sample Survey Organization (NSSO) has revealed that poverty levels in India have declined sharply in the last couple of years. And it only accelerates a sharp decline that started after a similar survey in 2004-05. Now, you have the “growthists” and the “distributionists”. The former claim that it is high economic growth in this time period that has led to the fall in poverty. The latter think it is due to redistributionist policies such as the National Rural Employment Guarantee Act (NREGA). Both sides have their merits. However, I’m not going to step into that debate now.

I ask a more fundamental question – how well can we trust the numbers that the NSSO has put out? My concern is this – that the poverty numbers have been gleaned out of a survey. I don’t have a problem with surveying – in fact surveying is a rather well-studied science, and I’m sure people at the NSSO are well-versed with it. My concern is that in this particular survey, the results may not have been properly extrapolated.

Most surveys rely on what is known as the “law of large numbers” and the “central limit theorem” and assume that the quantity being surveyed (people’s consumption expenditure as per this survey) follows a normal distribution. Except that we know that incomes (at least at the upper side of the scale) don’t follow a normal distribution. Instead, it has been shown that they follow what is called as a Power Law distribution.

While I don’t doubt the general quality of scholarship at the NSSO, I want to ask if they have actually studied the real distribution of incomes and used the appropriate one, rather than using a normal distribution. It could be that incomes at the lower end of the scale actually do follow a normal distribution, in which case standard sampling techniques might be used. If not, however, I expect and hope that the NSSO has used a sampling and extrapolation technique appropriate to the distribution incomes actually follow.

Let me illustrate the issue with an extreme example. Let’s say that one of the names drawn as part of the NSSO’s “random sample” for Mumbai is one Mr. Mukesh D Ambani. Assume that there are 99 other persons in Mumbai who are drawn in the same sample, and each of them has an annual household income of Rs. 1 lakh. What will be the mean income of the group? Assuming Mr. Ambani earns Rs. 10 Crore a year (number pulled out of thin air), the mean income of the group of 100 will come out to be close to Rs. 11 lakh!

This is the problem with estimating incomes using surveys and standard extrapolation techniques. While the above example might have been extreme, even in smaller groups of population, there will be “local Mukesh Ambanis” – people whose incomes are much higher than their peer group. Inclusion or exclusion of such people in a standard survey can make a massive difference.

I will end with an example and a request. I remember reading that any family in India that earns over Rs. 12 lakh a year (i.e. Rs. 1 lakh a month) is in the top 1% of all families in India! My family (wife and I) earn more than Rs. 12 lakh. But do we consider ourselves rich? By no means! Why? Because people who are richer than us are much richer than us! That is the problem with quantities that follow a power law distribution.

Now for the request. Can someone instruct me on the easiest way to get the raw data out of the NSSO? Thanks.