# Napkin Folding — statistics

## A real-life mistake I made about penalizer terms

Posted by **Cameron Davidson-Pilon** at

I made a very interesting mistake, and I wanted to share it with you because it's quite enlightening to statistical learning in general. It concerns a penalizer term in maximum-likelihood estimation. Normally, one deals only with the penalizer coefficient, that is, one plays around with \(\lambda\) in an MLE optimization like: $$ \min_{\theta} -\ell(\theta) + \lambda ||\theta||_p^p $$ where \(\ell\) is the log-likelihood and \(||\cdot||\) is the \(p\) norm. This family of problems is typically solved by calculus because both...

## Distribution of the last value in a sum of Uniforms that exceeds 1

Posted by **Cameron Davidson-Pilon** at

While working on a problem, I derived an interesting result around sums of uniforms random variables. I wanted to record it here so I don't forget it (I haven't solved the more general problem yet!). Here's the summary of the result: Let \(S_n = \sum_{i=1}^n U_i \) be the sum of \(n\) Uniform random variables. Let \(N\) be the index of the first time the sum exceeds 1 (so \(S_{N-1} < 1\) and \(S_{N} \ge 1\)). The distribution of \(U_N\)...

## Poissonization of Multinomials

Posted by **Cameron Davidson-Pilon** at

Introduction I've seen some really interesting numerical solutions using a strategy called Poissonization, but Googling for it revealed very few resources (just some references in some textbooks that I don't have access to). So here it is: my notes and repository for Poissonization. Theorem: Let \(N \sim \text{Poi}(\lambda)\) and suppose \(N=n, (X_1, X_2, ... X_k) \sim \text{Multi}(n, p_1, p_2, ..., p_k)\). Then, marginally, \(X_1, X_2, ..., X_k\) are are independent Poisson, with \(X_i \sim \text{Poi}(p_i \lambda)\). [1] The proof is as follows. By...

## Bayesian Methods for Hackers release!

Posted by **Cameron Davidson-Pilon** at

Finally, after a few years writing and debugging, I'm proud to announce that the print copy of Bayesian Methods for Hackers is released! It has update content, including a brand new chapter on A/B testing, compared to the online version. You can purchase it on Amazon today!

## How can I use non-constructive proofs in data analysis?

Posted by **Cameron Davidson-Pilon** at

In mathematics, there are two classes of proof techniques: constructive and non-constructive. Constructive proofs will demonstrate how to build the object required. Its construction proves its existence, hence you are done. An example of this is proving that prime numbers are infinite using Euclid's argument: to find a prime number, you multiply together all the prime numbers seen thus far and add 1. On the other hand, a non-constructive proof does not detail how to build the object, just states that it must...

## Bayesian M&M Problem in PyMC 2

Posted by **Cameron Davidson-Pilon** at

This Bayesian problem is from Allen Downey's Think Bayes book. I'll quote the problem here: M&M’s are small candy-coated chocolates that come in a variety of colors. Mars, Inc., which makes M&M’s, changes the mixture of colors from time to time. In 1995, they introduced blue M&M’s. Before then, the color mix in a bag of plain M&M’s was 30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan. Afterward it was 24% Blue , 20% Green, 16%...

## What The Name?!

Posted by **Cameron Davidson-Pilon** at

Kylea Parker and I over the holidays put together our first ever infographic! Now, her having a design background and myself having a stats background, we set out to do all infographics right: correct statistics and beautiful communication through design. I believe we achieved that. The data analysis was done using demographica.

## Dawkins on Saying "statistically, ... "

Posted by **Cameron Davidson-Pilon** at

Richard Dawkins, in his early book The Extended Phenotype, describes what he means when he says "statistically, X occurs". His original motivation was addressing a comment about gender, but it applies more generally: If, then, it were true that the possession of a Y chromosome had a causal influence on, say, musical ability or fondness for knitting, what would this mean? It would mean that, in some specified population and in some specified environment, an observer in possession of information...

## [Video] Presentation on Lifelines - Survival Analysis in Python, Sept. 23, 2014

Posted by **Cameron Davidson-Pilon** at

I gave this talk on Lifelines, my project on survival analysis in Python, to the Montreal Python Meetup. It's a pretty good introduction to survival analysis, and how to use Lifelines. Enjoy!

## Why Your Distribution Might be Long-Tailed

Posted by **Cameron Davidson-Pilon** at

I really like this below video explaining how a long-tailed distribution (also called powerlaw distributions, or fat-tailed distributions) can form naturally. In fact, I keep thinking about it and applying it to some statistical thinking. Long-tailed distributions are incredibly common in the social science: for example, we encounter them in the wealth distribution: few people control most of the wealth. social networks: celebrities have thousands of times more followers than the median user. revenue generated by businesses: Amazon is larger than...

## The Class Imbalance Problem in A/B Testing

Posted by **Cameron Davidson-Pilon** at

Introduction If you have been following this blog, you'll know that I employ Bayesian A/B testing for conversion tests (and see this screencast to see how it works). One of the strongest reasons for this is the interpretability of the analogous "p-value", which I call Confidence, defined as the probability the conversion rate of A is greater than B, $$\text{confidence} = P( C_A > C_B ) $$ Really, this is what the experimenter wants - an answer to: "what are...

## Using Census Data to Find Hot First Names

Posted by **Cameron Davidson-Pilon** at

We explore some cool data on first names and introduce a library for making this data available. We then use k-means to find the most trending names right now, and introduce some ideas on age inference of users. Freakonomics, the original Data Science book One of the first data science books, though it wasn't labelled that at the time, was the excellent book "Freakonomics" (2005). The authors were the first to publicise using data to solve large problems, or to...

## The Binary Problem and The Continuous Problem in A/B testing

Posted by **Cameron Davidson-Pilon** at

Introduction I feel like there is a misconception in performing A/B tests. I've seen blogs, articles, etc. that show off the result of an A/B test, something like "converted X% better". But this is not what the A/B test was actually measuring: an A/B test is measuring "which group is better" (the binary problem), not "how much better" (the continuous problem). In practice, here's what happens: the tester waits until the A/B test is over (hence solving the binary problem),...