cancel
Showing results for
Did you mean:
Gamification
Data Science

## The Economics of 90-9-1: The Gini Coefficient (with Cross Sectional Analyses)

Dr. Michael Wu, Ph.D. is Lithium's Principal Scientist of Analytics, digging into the complex dynamics of social interaction and online communities.

He's a regular blogger on the Lithosphere and previously wrote in the Analytic Science blog.

Last week, I discussed the Lorenz curve and how to use it to quantify precisely how much content is contributed by various portions of the community population. This article builds on my previous posts, so I would recommend reading  through the following articles before diving into this one.

So, one of the concepts I discussed last week was the utility of the Lorenz curve with data from Lithosphere. The question is what does the Lorenz curve for other communities look like?

If there is a community where everyone participates equally, for example, everyone posts exactly 10 messages, then the Lorenz curve would be a straight line (fig 1: Perfect Equality). As participation deviates from perfect equality, the curve will bow downward (fig 2: Unequal). The greater the inequality, the further the curve is depressed down and to the right (fig 3: More Unequal). In the extreme case, where one person produces all the content and everyone else just lurks, then the curve would turn into a rectangular corner (a.k.a. the delta function) as shown below (fig 4: Total Inequality).

So the shape of the Lorenz curve tells us how unequal the participation is in any community.

The Gini Coefficient.

Although we can visually examine the Lorenz curve and get a good sense of how unequal the participation is in a community, we still haven't numerically quantified the degree of inequality. To do this, the Italian statistician Corrado Gini created the Gini coefficient, which, by definition, is the area between the Lorenz curve (the red line in the above figure) and the line of Perfect Equality (the diagonal blue dotted line). This area is also normalized, so that when there is total inequality, the area between the red rectangular corner and the dotted blue line of perfect equality is 1. So the Gini coefficient is just the area of the yellow patches in the above figures. The Gini coefficient is sometimes multiplied by 100 to rescale it to an easily understandable score. This rescaled version is also known as the Gini index.

When there is perfect equality, the Gini coefficient would be zero (fig 1: G1=0). As the participation level deviates from perfect equality, the Gini coefficient will increase (fig 2: G2>0). The greater the inequality, the larger the numerical value of the Gini coefficient (fig 3: G3>G2). In the extreme case of total inequality the Gini coefficient would be one (fig 4: G4=1).

Now, I can calculate the Gini coefficient for Lithosphere using the same data I used last week (lurkers excluded for simplicity). The Gini coefficient for Lithosphere's post activity cumulatively as of Feb 28, 2010 is Gc=0.79. I can do the same for all communities in our data warehouse, and compute the mean level of participation inequality. The mean Gini coefficient for all our communities turned out to be 0.64 with SD=0.11. So the participation in Lithosphere is rather unequal among the participants (remember we excluded the lurkers for now).

Slicing and Dicing the Data

Having the Gini coefficient for all the communities in our data warehouse, I can now compare and contrast the data across industry, community type (support, marketing, or innovation), and audience (B2C, B2B, or internal).

1. I've used 161 of our communities that have industry data in our data warehouse for this analysis.
2. However, 4 of them do not have audience data, so the sample size, N, for the audience segmentation only sums up to 157.
3. The sample size, N, for the community type segmentation sum up to 175 because few communities have multiple types (e.g. a community may have both support and marketing function).
4. The HiTech Segment includes high tech product as well as software industries.
5. The Consumer segment includes consumer product, retail and non-hiTech products industries.
6. The Entertainment segment includes media, entertainment and gaming industries.
7. The Financial segment include financial, insurance, and health care industries.

We can see clearly that the mean Gini coefficients (0.71) for marketing communities are higher than those of support and innovation communities. Likewise, the participation level in B2C communities is more unequal than those of B2B and internal communities, albeit less significantly (because the mean Gini coefficient for B2C communities is only 0.67, slightly above the mean).

If we segment the communities with a coarse binning by industry, we can see that the degree of participation inequality is not significantly different across industries, except for communities in the entertainment industry. Note that most of the industry average Gini coefficients are all very close to the community mean of 0.64, where as the average Gini coefficient for the entertainment industry is 0.75 (about 1 standard deviation above the mean).

What Does this Inequality Mean?

So what does inequality of participation mean to you? And what does it really mean for a Gini coefficient to be 0.64 as opposed to 0.75?

To address these questions, I will do the same analysis I performed in my earlier blog with these segmented communities. So here is the data you've been asking for, at least some of it (I don't want to turn this blog into a full fledge academic paper!).

Using the Lorenz curve, I can easily compute the fraction of content produced by the top 10% of the participants (lurkers excluded again), which should correspond to the 1% of hyper-contributors in the 90-9-1 rule.

These data show a strong correlation with the mean Gini coefficient data above. So, greater participation inequality (corresponding to a larger Gini coefficient) means that the hyper contributors are more prolific. This makes intuitive sense because you can think of inequality as the difference between the most prolific and the least prolific users. Because the least prolific users are always the lurkers with zero participation, the bigger the inequality or difference between the two extremes, implies that the top users must be more productive.

Since marketing communities, B2C communities, and communities in the entertainment industry have a higher Gini coefficients, their hyper-contributors (defined to be top 10% of the participants in this case) produce more content than other communities, 64%, 60% and 69% respectively compare to the mean of 55%

Once I have computed the Lorenz curve, turning the problem around is trivial. If we define "most of the community content" to be at least 50% (see The 90-9-1 Rule in Reality), then the Lorenz curve gives us an estimate of the hyper-contributor population as the fraction of participants that is require to produce at least 50% of the total content.

This data is anti-correlated with the Gini coefficient data in the previous section. So, greater participation inequality means that the percentage of hyper-contributors will be smaller. This is consistent with the observation we made earlier that the hyper-contributors will be more prolific when there is greater participation inequality. As they are more prolific, naturally fewer of them will be needed to contribute the same amount, in this case, 50% of the total content. I am not going to recite the data points here, just look at the chart and ask me if you have any questions.

Why use the Gini Coefficient?

Since the Gini coefficient is highly correlated with the fractional contribution of the top participants, you might wonder why bother with the Gini coefficient at all? The answer is its elegant simplicity and accuracy.

I have deliberately left out the lurkers in all our discussion, so there is one number we have to track (i.e. either the fractional contribution or the proportion of hyper-contributors). If I were to put the lurkers back into the picture, then we will need another number that quantifies the ratio between lurkers and participants. If I want greater accuracy with finer granularity than just the lurkers, occasional- and hyper-contributors (say, I also want to know about a group call the moderate-contributors), then we will need more numbers.

In contrast, because the Lorenz curve tracks the data for all possible participation level, it has all the accuracy we will ever need. Despite that, the Gini coefficient will always be a single number. Let me illustrate the utility of this with a hypothetical example.

Suppose you encounter three communities where one follows the 90:9:1 rule precisely, the second one follows a rule that is numerically more like 94:4:2, and the third follows the 88:10:2 rule. Question: which community has greatest level of participation inequality? Even if these numbers are accurate, it is not so obvious to rank them. With the Gini coefficient, we can use a single number that accurately quantifies the participation inequality. So we can easily identify the one with the largest Gini coefficient, thus, the one that has the most unequal level of participation. Even though you might not care about participation inequality, but you might want to know which community have the most prolific hyper-contributors, or the relative proportion of hyper-contributor populations.

With a simple yet accurate statistics like the Gini coefficient, we can compute the Gini coefficient for a window of activity at different time and watch how the participation inequality changes as the community grow. We can also build accurate models that have strong predictive powers. The possibilities become endless! When you can rigorously quantify something, that's when you turn it into a science. That is when you can gain quantitative and predictive insights. And that is when all the fun begins - at least for me.

As always, please let me know if you have questions or thoughts. This is a long blog with a lot of data, so we will take a short break from the 90-9-1 data mantra and come back to it later. Next time let's explore the science of influence.

Sorry if I miss this somewhere, but how do you define different types of communities? Could you provide some examples?

Thanks!

D

Gamification
Data Science

Hello Diana,

Thank you for the comment.

The community types (support, marketing and innovation) are determined by our client. Client X comes to us and says he want to launch a support community, then that community will be a support community. If they want a community that serves both support and marketing function, then that community will be both a support and marketing type.

AT&T, BestBuy, Lenovo, are all Support communities. Since their end users are mostly customers they are B2C community.

BT Buziness and PitneyBowes, are also Support communities, but since their end users are other business enterprises, they are B2B communities.

Barnes&Noble and Playstation are both Marketing/Sales communities, and they are also B2C.

Hope this helps.

Community Management

Michael,

This was really fascinating stuff, and as we are watching the number of solutions grow and the number of TKB articles in our community, I've come back to consider this.

In your model, are you considering all the content to be of the same value?

Have you ever plotted solutions or kudos or some other qualatative measure against this?  Would we tend to find the curve even steeper?   Purely academic I suppose since the super contributors are going to tend to be heavily represented in those measures by the very nature of their behaviors in the community.

Mark

Gamification
Data Science

Hello Mark,

Thanks for revisiting this post. My blogs are always open to discussion.

Currently, all post are treated the same. I've not plotted the solution or kudo against the Lorenz curve, But I have plotted the Lorenz curve with posts weighted by kudos and accepted solutions. What I did was simply add 1 for every kudo that a post received, and add 2 if the post is an accepted solution.

So If I have written 10 post, and I received 5 kudos on these 10 post (doesn't really matter how they are distributed, because I simply add them. In fact that is why I choose to add them instead of using them as a multiplicative factor) that is the same as if I posted 15 posts. And if 2 of the 10 post are solutions, then essentially I get 19 points.

If I construct the Lorenz curve based on the kudo/solution augmented post counts, then the curve do get steeper. In fact much steeper. The hyper contributors are getting even more weights, making the curve almost hard to read. That is one of the reason that I didn't do that in my current model (currently all posts are treated equal). But this matchs our intuition. Remember Lorenz curve and Gini Coefficient were developed to quantify economic and income inequality. If we use one type of asset to quantify wealth, the super rich get a lot, if we include any other assets, they seem to get even richer and bigger slice of the pie. This is the same result I get with community data. So social equity behave very much like real equity.

So your conjecture is totally right.

Thanks for commenting.