Why Your Distribution Might be Long-Tailed
Posted by Cameron Davidson-Pilon at
I really like this below video explaining how a long-tailed distribution (also called powerlaw distributions, or fat-tailed distributions) can form naturally. In fact, I keep thinking about it and applying it to some statistical thinking. Long-tailed distributions are incredibly common in the social science: for example, we encounter them in
- the wealth distribution: few people control most of the wealth.
- social networks: celebrities have thousands of times more followers than the median user.
- revenue generated by businesses: Amazon is larger than the next dozen internet retailers combined
- book sales: most books have less than a thousand sales, but a handful of books have millions of sales.
I've enumerated only four, but there are many others. Below is a video explaining one reason why we see powerlaw distributions.
I really like this video because it gives a very intuitive reason why powerlaws can exist. It also cements that the winners become winners scheme will create powerlaws and extreme inequalities.
It's tempting to go in reverse: given you see a powerlaw, the distribution must have been generated by a winners become winners scheme. Unfortunately, this is not true: there are other schemes that can create powerlaws. But I would go as far as to say that most empirical powerlaws we see are generated by winners become winners. The trick is to find out why winners become winners.
Duration as a Powerlaw
The above gives a good generating scheme of how powerlaws arise for quantities (books sales, wealth, followers), but what about durations? There exist durations that are powerlawed: response times in server requests are powerlawed, ages of technologies are powerlawed. The generating scheme winners become winners doesn't carry over well to the time domain. Is there an analogy we can use?
Update: Françoise P., in the comments, offers an interesting solution. Packet arrival time in networks often follow a powerlaw. There is another case, similar to this. Consider a random walk over the integers, starting at 0 and with a 0.5 chance of moving up one, and a 0.5 chance of moving down one. The distribution of the duration between returning to 0 is powerlawed: the probability of a duration of length n is proportional to one over 2 to the power of n.
- 1 comment
- Tags: statistics
Latest Data Science screencasts available
Comments
If you send a pulse of light on a solar cell and look at the electric current after you turn the light off, it drops like a power law. This time-domain power law is explained by the presence of defects in the semiconductor that trap and release the electrons on their way to the electrodes. By going through many trapping and untrapping cycles, some electrons reach the electrode much later, hence the long tail.
Perhaps those traps can be used as an analogy for modelling the hidden mechanism responsible for slowing down server responses?