TLDR: upgrade lifelines for lots of improvements
pip install lifelines==0.20.0
During my time off, I’ve spent a lot of time improving my side projects so I’m at least kinda proud of them. I think lifelines, my survival analysis library, is in that spot. I’m actually kinda proud of it now.
A lot has changed in lifelines in the past few months, and in this post I want to mention some of the biggest additions and the stories behind them.
The Cox proportional hazard model is the workhorse of survival analysis. Almost all papers, unless good reason not to, use the Cox model. This was one of the first regression models added to lifelines, but it has always been too slow. It’s implemented in Numpy, but there was a tricky for
loop still in Python. I had ideas on how to turn that loop into a vectorized Numpy operation of matrix products, but there would have been an intermediate variable that created a d x d x p
tensor, where d
is the number of independent covariates and p
is the size of some subset of subjects (at worst, p could be equal to the number of subjects). This will quickly explode the amount of memory required and hence performance would degrade.
One night, on Twitter, I noticed some posts about how to use einsum
in PyTorch and Numpy. I had previously heard of it, but didn’t think it was something that was implemented in Numpy nor did I think it was something I could use. Turns out, einsum
is a way to do matrix multiplication without intermediate variables! How? Since a product of matrices is a just multiplications and sums, one can declaratively define what the end product should look like and the internals of einsum
will compute the intermediate operations at the C layer (I’m simplifying and my own understanding of this is shaky). Suffice to say, after racking my brain and lots of trial and error with einsum
, I could replace the tricky Python for
loop with einsum
! This resulted in a 3x performance increase. However, what I've gained in performance, I've lost in some readability of my code. I’ve put some references to einsum
in the bottom of this article.
The second significant improvement to the performance of the Cox model is using a meta-algorithm to select the fastest algorithm. There are two ways to compute the likelihood, and the performance of each is highly dependent on a few characteristics of the dataset, notably how many ties are in the dataset. After running some tests, I noticed that the delineation of when one was faster than the other was not clear, so a heuristic like if num_ties > 10:
would not be very effective. What I did instead was to generate hundreds of artificial dataset of varying ties and varying size, and measured the timing of both algorithms. I then fit a linear model to the ratio of the times, conditioned on the ties and size (and their interaction). It was a surprisingly good fit! So now in lifelines, at runtime, I compute some statistics about the incoming dataset, plug these values into a fitted linear model, and the result is the predicted ratio of timings between the two algorithms. I then choose which algorithm would be faster and continue on. The prediction is super fast (it’s a linear model after all), so there are no performance hits there. With this meta-algorithm, the lifelines Cox implementation is up to 3x faster for some datasets. I wrote up a full summary of the idea in a previous blog post [2].
Overall, the Cox model is now 10x faster than it was a few months ago. (Also, I only mention it here, but the Aalen Additive model is like 50x times faster, but most of those speed improvements where replacing Pandas with NumPy in critical points.)
One large gap in lifelines was checking the proportional hazards assumption, which is critical for any kind of inference-focused modeling (it matters less for prediction tasks). The author of the popular R survival library, Terry Therneau, has made massive contributions to survival analysis techniques, including a statistical test for non-proportionality. This test relies on an important residual of the Cox regression. While implementing this statistical test in lifelines, I realized there was a more general solution for handle all residuals, so I added functionality to compute the most common residuals.
These additions enabled a new, very user friendly function, check_assumptions
, which prints out potential proportionality violations in human readable format and offers advice on how to fix it. I also introduced residuals plots:
I think too many people are focused on deep learning. There, I said it, and probably you agree. However, some really cool technologies are falling out of that area that others can use. One of them is libraries that implement automatic differentiation, aka autodiff. This is like the holy grail for computational statisticians. Let me quickly explain why: given an arbitrary numerical function, you can automatically compute its exact gradient at any (valid) point. No rounding errors. No restrictions on the function. Just gradients. To quote directly from [1]:
/beginquote
Q: What’s the difference between autodiff and symbolic diff?
R: They are totally different. The biggest difference is that autodiff can differentiate algorithms, not just expressions. Consider the following code:
function f(x)
y = x;
for i=1…100
y = sin(x+y);
return y
Automatic differentiation can differentiate that, easily, in the same time as the original code. Symbolic differentiation would lead to a huge expression that would take much more time to compute.
Q: What about non-differentiable functions?
R: No problem, as long as the function is differentiable at the place you try to compute the gradient.
/endquote
So why am I so excited about this? I suffered through a week of frustrations and headaches trying to implement a log-normal survival model by hand. You can see my frustration here [3]. Also, my second-derivative calculations were abysmal, which meant we would be computing unreliable confidence intervals. The whole thing made me depressed. I was pointed to the Python library autograd [4], and after some wrestling, it was like a beam of heaven shown down on me. Computing gradients with autograd is easy, computing second-derivatives is easy, and performance is near identical. I imagined future generalizations and abstractions, and this has radically simplified my code base. One cool idea is that since we know the second derivative exactly, we can compute the variance matrix of the fitted parameters exactly, and we can use the delta method (and autograd) to compute variances of arbitrary functions of those fitted parameters. Here are two worked examples of the cool things you can do in lifelines now:
Autograd also enabled lifelines to implement accelerated failure time models, so users now have three new regression models to play with. I’ve been so happy with autograd that I’ve converted parts of my other Python library, lifetimes, to use it as well.
I’ve given the lifelines docs a serious facelift, probably doubled the amount of content, and edited large parts of it. Overall, I am much happier with the docs now. One addition I made was adding tracking of visitors’ searches on the docs site. This gives me some idea of where users might be confused, or where current docs structure is insufficient.
It's been a productive few months, and I think lifelines is in a good state. More of my attention now is on lifetimes (a new version was just released, by the way) and some other topics. Hope you enjoy lifelines!
[3] https://github.com/CamDavidsonPilon/lifelines/issues/622
]]>CoxTimeVaryingFitter
, is used for time-varying datasets. Time-varying datasets require a more complicated algorithm, on that works by iterating over all unique times and "pulling out" the relevant rows associated with that time. It's a slow algorithm, as it requires lots of Python/Numpy indexing, which gets worse as the dataset size grows. Call this algorithm batch. The second implemented class, CoxPHFitter
, is used for static datasets. CoxPHFitter
uses a simpler algorithm that iterates over every row once only, and requires minimal indexing. Call this algorithm single¹.
The strange behaviour I noticed was that, for my benchmark static dataset, the more-complicated CoxTimeVaryingFitter
was actually faster than my simpler CoxPHFitter
model. Like almost twice as fast.
The thought occurred to me that iterating over all unique times has an advantage versus iterating over all rows when the cardinality of times is small relative to the number of rows. That is, when there are lots of ties in the dataset, our batch algorithm should perform faster. In one extreme limit, when there is no variation in times, the batch algorithm will perform a single loop and finish. However, in the other extreme limit where all times are unique, our batch algorithm does expensive indexing too often.
This means that the algorithms will perform differently on different datasets (however the results should still be identical). And the magnitude of the performances is like a factor of 2, if not more. But given a dataset at runtime, how can I know which algorithm to choose? It would be unwise to let the untrained user decide - they should be abstracted away from this decision.
As a first step, I wanted to know how often the batch algorithm outperformed the single algorithm. To do this, I needed to create datasets with different sizes and varying fractions of tied times. For the latter characteristic, I defined the (very naive) statistic as:
$$\text{n unique times} = \text{frac}\; \cdot \; \text{n rows} $$
Once dataset creation was done, I generated many combinations and timed the performance of both the batch algorithm (now natively ported to CoxPHFitter
) and the single algorithm. A sample of the output is below (the time units are seconds).
N | frac | batch | single |
432 | 0.010 | 0.175 | 0.249 |
432 | 0.099 | 0.139 | 0.189 |
432 | 0.189 | 0.187 | 0.198 |
... | ... | ... | ... |
So, for 432 rows, and a very high number of ties (i.e. low count of unique times) we see that the batch algorithm performance of 0.18 seconds vs 0.25 seconds for the single algorithm. Since I'm only comparing two algorithms and I'm interested in the faster one, the ratio of batch performance to single performance is just as meaningful. Here's all the raw data.
Looking at the raw data, it's not clear what the relationship is between N, frac, and ratio. Let's plot the data instead:
What we can see is that the ratio variable increases almost linearly with N and frac. Almost linearly. At this point, one idea is to compute both the statistics (N, frac) for a given dataset (at runtime, it's cheap), and predict what its ratio is going to be. If that value is above 1, use single algorithm, else use batch algorithm.
We can run a simple linear regression against the dataset of runtimes, and we get the following:
import statsmodels.api as sm
X = results[["N", "frac"]]
X = sm.add_constant(X)
Y = results["ratio"]
model = sm.OLS(Y, X).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: ratio R-squared: 0.933
Model: OLS Adj. R-squared: 0.931
Method: Least Squares F-statistic: 725.8
Date: Thu, 03 Jan 2019 Prob (F-statistic): 3.34e-62
Time: 13:11:37 Log-Likelihood: 68.819
No. Observations: 108 AIC: -131.6
Df Residuals: 105 BIC: -123.6
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.3396 0.030 11.367 0.000 0.280 0.399
N 4.135e-05 4.63e-06 8.922 0.000 3.22e-05 5.05e-05
frac 1.5038 0.041 37.039 0.000 1.423 1.584
==============================================================================
Omnibus: 3.371 Durbin-Watson: 1.064
Prob(Omnibus): 0.185 Jarque-Bera (JB): 3.284
Skew: -0.166 Prob(JB): 0.194
Kurtosis: 3.787 Cond. No. 1.77e+04
==============================================================================
Plotting this plane-of-best-fit:
We can see that the fit is pretty good, but there is some non-linearities in the second figure we aren't capturing. We should expect non-linearities too: in the batch algorithm, the average batch size is (N * frac) data points, so this interaction should be a factor in the batch algorithm's performance. Let's include that interaction in our regression:
import statsmodels.api as sm
results["N * frac"] = results["N"] * results["frac"]
X = results[["N", "frac", "N * frac"]]
X = sm.add_constant(X)
Y = results["ratio"]
model = sm.OLS(Y, X).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: ratio R-squared: 0.965
Model: OLS Adj. R-squared: 0.964
Method: Least Squares F-statistic: 944.4
Date: Thu, 03 Jan 2019 Prob (F-statistic): 2.89e-75
Time: 13:16:48 Log-Likelihood: 103.62
No. Observations: 108 AIC: -199.2
Df Residuals: 104 BIC: -188.5
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.5465 0.030 17.941 0.000 0.486 0.607
N -1.187e-05 6.44e-06 -1.843 0.068 -2.46e-05 9.05e-07
frac 1.0899 0.052 21.003 0.000 0.987 1.193
N * frac 0.0001 1.1e-05 9.702 0.000 8.47e-05 0.000
==============================================================================
Omnibus: 10.775 Durbin-Watson: 1.541
Prob(Omnibus): 0.005 Jarque-Bera (JB): 21.809
Skew: -0.305 Prob(JB): 1.84e-05
Kurtosis: 5.115 Cond. No. 3.43e+04
==============================================================================
Looks like we capture more of the variance (\(R^2\)), and the optical-fit looks better too! So where are we?
This idea was recently implemented in lifelines, and with some other optimizations to the batch algorithm, we see a 60% speedup on some datasets!
¹ ugh, naming is hard.
]]>I've been playing the video game Heroes of the Storm by Blizzard for about 3 years now. I even attended the game's professional scene's championship last year! One part of the game that has attracted me is the constantly accumulating character roster (I'll provide a short summary of the game later in this post). This means that new characters are being added, and existing characters are being tweaked if they are too powerful or not powerful enough. It's this latter feature that I am most interested in, and the purpose of this blog post: how do we know when a character is over or under powered? There are two methods: a wrong method, and a less-wrong method. This blog article's goal is to re-describe the the old, wrong method, and propose a new ideas for character balance. Keep in mind: the estimated values count for less than the description of the idea.
But before we get into the methods, a quick summary of Heroes of the Storm gameplay (you can skip if you are already familiar).
Heroes of the Storm is a 5-player vs 5-player team game. Before the game begins, each player choose a unique character (called heroes in the game) and is combined with four other players to battle against the opposing team. The objective is to destroy the opponents "core", using either the characters' unique abilities or powerful bonuses scattered around the play area (called a "map"). The game ends when either team's core is destroyed.
At the time of writing, there are over 70 characters, each with unique abilities and advantages. A players decision to choose a characters is based on i) their affinity for a character, ii) which characters are strategically advantageous, given what your opponent is playing (called the "meta"), or iii) which characters are over powered. Focusing on this last reason, a character can be over powered / under powered if their abilities are providing too much damage, healing, etc. Blizzard, the game's creator, will address these different power level of characters with weekly balance patches - however with the game constantly changing, there will never really be perfect equality.
Players can upload their games to third-party services, which aggregate all the games players, characters, map and outcomes into valuable statistics. A very influential statistic is the win rate of character i, defined as:
$$\frac{\text{wins}_{i}}{\text{games participated}_i} $$
which is an estimator for:
$$P(\text{win} \;|\; \text{$i$ present in game})$$
Ideally, the win rate of all characters are close to 0.5, as 0.5 is suppose to represent a character that is equally like to win as to lose (and hence "balanced"). Estimates typically range from 0.35 for the most under-powered, to 0.65 for the most over-powered. In fact, the ordering of this estimate over all characters is commonly used by players to find the most over-powered and under-powered characters, and call for "nerfs" or "buffs" respectively to rebalance the character.
But this is wrong. The win rate is only a correlation. It's not taking into account the entire causal picture. Let's explore that next.
The problem with the above estimate, which I call the naive win rate, is that character selection and the chance of winning have common causes. These common causes are called confounders. For example, professional players may a) select certain characters more often than others and b) win more often. In an extreme case, if the best player in the world selected character T almost exclusively, then character T would have a higher than average win rate, but not necessarily due to the character.
Another example: depending on the other characters on an allied team, a) selection of a character changes, and b) probability of winning changes. If character T is often paired together with S because of some pair synergy, then the win rate of T will be influenced by the win rate of S.
Another example: the higher a team's average MMR (a measure of experience) may a) select a "meta" character more often and b) win more often. More specifically, the more experienced a team, the more likely they will recognise the best character to play in response to their opponents character, and because of such experience, are more likely to win the game.
Another example: when a new hero is released, or an existing hero gets major changes, less experienced players may choose them more. This has an effect on character selection, and on the win rate. Another confounder.
The point is: there exist many causes of both character selection and win probabilities. This means that the naive win rate is not necessarily a good measure of balance, but simply a correlation (not causation) between character selection and winning.
In fact, if you are familiar with causal diagrams, this is perhaps what the causal diagram looks like:
The arrow we are interested in is the arrow between character and win, which I call the causal win rate. The strength of this arrow (difference from 0), and its direction (positive or negative) tell us if a character is over or under powered. This tells us "how much more likely is a win if you choose character T instead of character S?"
We have lots of statistical methods to handle the confounders that muddy our naive win rate. The end result is an estimate of the causal win rate. However, causal inference requires you to state assumptions. Here's the big causal assumption, called exchangeability:
Controlling for the map, opponents' average experience, opponents' characters, allied characters, and allied experience, the win rate of those players that did and didn't choose character T would be the same on average if they all chose character T.
We are also going to simplify the problem, which I'm sure is wrong but it's a first approximation. Let's assume that the character in question is chosen last. That is, the decision to choose character T happens with perfect information of all other character selections and map knowledge. We are also assuming no effect modification (which certainly exists). Effect modification is when the effect (causal win rate) can be modified by other factors. For example, we already mentioned above that possible synergies can exists between pairs of characters - that is a form of effect modification. Similarly, the team composition (ex: how many healers the team has, how many damage dealers, etc.) is a form of effect modification. Like I said, we are going to ignore effect modification for now, and will return to it in another blog post.
For our modelling, we will use outcome regression, which is a causal inference method that carries with it lots of modelling assumptions (Note: there are better methods which have less model assumptions). For example: we are assuming a linear relationship between outcome and variables and we are assuming no interactions (effect modifiers).
Our regression will look like:
$$ \begin{align} \text{logit}(\text{win}_i) & = \alpha_1 \text{AlliedVallaPresent}_i \\ & + \alpha_2 \text{AlliedLiLiPresent}_i + ... \\ & + \beta_1\text{AvgAlliedMMR}_i \\ & + \beta_2\text{AvgOppMMR}_i \\ & + \beta_3\text{Map}_i + ... \\ \end{align} $$
where each a coefficient represents a scalar or vector (as in the case of \(\text{Map}_i\)).
Note that (and this is a consequence of our simple model) that all characters in the regression are symmetric. Thus we just need to run a single regression, and the parameters \(\alpha_i\) are all we are interested in.
What does \(\alpha_i\) represent though? Nothing without a reference point. I chose to use the character Kerrigan as the reference character. A positive alpha represents a better causal win rate relative to Kerrigan, a null alpha represents an equal causal win rate relative to Kerrigan, and a negative alpha represents a worse causal win rate relative to Kerrigan. These relative comparisons may seem like a disadvantage, but I chose Kerrigan specifically as she was the median character with respect to causal win rate (so I ran the program once, using some other character, and Kerrigan was the median in that analysis). Truthfully, it doesn't really matter who is the reference character, the ordering between characters is agnostic to the reference character. However, I choose the median character because they create a natural division of who is over-powered and who is under-powered.
The dataset we are going to use is from user-submitted games to hotslogs.com. The time period is 4/26/2017 to 5/9/2017 which is a period inbetween two balance patches. I filtered down to Hero League and Team League games, and sampled 50k games for analysis.
After we fit the regression, we have an estimated value of alpha for each character. Importantly, we also have error estimates for each alpha that we use to form confidence intervals for each alpha. If this confidence interval contains 0, we can conclude that the character is either balanced, or unbalanced too little to detect. Here are the results (clickthrough for a larger image):
The characters Li Li, Lucio, and Probius are the strongest. On the other hand, Medivh, Tassadar and E.T.C. are the weakest. For some evidence that our analysis is not absolutely wrong: 5 of the top 6 characters above were nerfed in the May 10th, 2017 balance patch.
How does our causal win rate estimates compare to the naive win rate?(Clickthrough for a larger image)
The two metrics are positively correlated, which offers evidence that the naive win rate isn't biased too much by the confounders. The characters in the upper-left and lower-right quadrants are characters that the community is probably thinking about wrong. For example, Lt. Morales has a low naive win rate, and the community might suggest she get a buff. However, she likely is over performing in this dataset. Another very interesting thing to point out is that all the support characters are "shifted" about 0.2 units to the right. Historically this period was when the double-support meta (two or more support characters per team) was gaining popularity, perhaps because Blizzard kept supports artificially over powered as the above graph suggests. Note that this support character advantage is hidden if we only looked at naive win rates. This support advantage is not a artefact of the analysis either, as when I did the same analysis for a more recent dataset, the support shift wasn't present any more.
I managed to get a more recent data (as of the time of writing) from July 2018, spanning a balance patch. So we can observe what happens to heroes' win rates as they were buffed or nerfed on the July 25th balance patch. First, some of the nerfed characters:
And buffs: Azmodan and Kael'Thas saw straight buffs, however Hanzo saw some buffs and some nerfs, and his causal win rate stayed about the same.
There are a lot of critiques one could do about this analysis:
What I hope from this article is that Heroes of the Storm players, and players of similar games, rethink how the naive win rate can be biased. However, one thing that I'd like to think about more is how biased are we? We saw that there was a high correlation between the naive and the causal win rate - does this imply we can use the naive win rate in most circumstances? I'm not sure, and it's something I want to explore. An approach is to use a causal inference model with less assumptions, and including effect modification, to see if we still see a high correlation between the two rates.
Sketch code is available on my Github here.
]]>
The authors start by saying that researchers will naively stratify a risk factor, like smoking, by the birthweight (BW) of infants. Researchers noticed that if the risk factor produced low birthweight (LBW) infant more often, then there would be a specific birthweight where infants below this threshold would have a lower mortality if in the riskier group. For example, for infants born below ~2000g, those infants who were exposed to maternal smoking actually had a lower infant mortality rate than those not exposed to maternal smoking. See figure below:
This seems like madness - not smoking is associated with higher infant mortality in some populations! Some researchers have even suggested that:
"the effect of maternal smoking is modified by birth weight in such a way that smoking is beneficial for LBW babies."
Let's investigate what is really going on.
The authors use a large dataset of over 3 million infants. They define LBW as less than 2500g. Their metric of interest, infant mortality, is defined as a death in the first year per 100,000 livebirths. How do they summarise the effect of maternal smoking?
"We used logistic regression to estimate infant mortality rate ratios and 95 percent confidence intervals..."
In other words, they will be using a specific coefficient in a logistic regression model to produce a summary of the effect of maternal smoking on mortality. Specifically, if we let \(M\) denote the presence of maternal smoking (0 or 1) our model might look like:
$$\text{logit}(p_i) = \beta_0 + \beta_1 M_i $$
and so \(\beta_1\) summarizes the effect of maternal smoking. We'll call the above model the "crude" model. Let \(B\) denote birthweight. We call the following model the "adjusted" model:
$$\text{logit}(p_i) = \beta_0 + \beta_1 M_i + \beta_2 B_i$$
The next question to ask is what else to control for in this regression. The authors say:
"control for potential confounders (maternal age, gravidity, education, marital status, race/ethnicity, and prenatal care). Since adjustment for these potential confounders had virtually no effect, we present only the unadjusted rate ratios below."
What does this mean? If we extended the model to:
$$\text{logit}(p_i) = \beta_0 + \beta_1 M_i + \beta_2 B_i + \beta \cdot C_i$$
where \(C\) are other controls mentioned above, this doesn't change the conclusion (or the coefficients - it's not clear), so the author choose to present results without the additional controls.
After running the regression, the crude model produced that \(\beta_1 = 0.438 \pm 0.029 \). (The paper actually uses the rate ratio, which is the exponential of the coefficient in the model - we won't). However, if we use the adjusted model, then \(\beta_1 = 0.049 \pm 0.032 \), a 10-fold decrease. What happened? Why, after conditioning on BW, did we seem to lose so much effect? It gets worse, in fact.
If we stratify the population into LBW and non-LBW (a "stricter" way to control for a continuous variable), and train the crude model on both these sub-group, we see that smoking actually benefits the infants in the LBW sub-group: the rate ratio of maternal smoking less than 1 (the opposite conclusion is found for the non-LBW sub-group).
I'm going to assume the reader is familiar with causal diagrams. I really liked that the authors went through increasingly complex causal diagrams to explain the paradox. I'll do the same. (Diagrams produced by my latest side project, dog).
This is the simplest model of our causal relationships. Smoking causes LBW, but has no direct effect on mortality, except through LBW. A very useful thing I learned from this paper is that a causal diagram has implications that the data should support, and if the data doesn't support it, the causal diagram is invalidated! Let's see how this works using the results from the regressions. If the above causal diagram is true, then smoking and mortality are certainly correlated. And we see this in the crude regression above (the coefficient is not 0). On the other hand, this diagram also implies that conditioning on BW means that smoking and mortality are independent. However, in the adjusted model above, where we do condition for BW, we still see that the coefficient of smoking is still far from 0. Thus, the data suggests that smoking and mortality are dependent, conditioned on BW. Hence, this diagram is wrong.
We had success with our implication test above. Let's try the same test on this new causal diagram. In the above causal diagram, if we condition on smoking, LBW and mortality should be independent. However, in the adjusted model, LBW has a non-zero coefficient. So we can refute the above diagram.
In this new model, we have smoking causes LBW, smoking causes mortality and there is a causal relationship between LBW and mortality. We see that LBW is a mediator for smoking. The authors are more precise and say that smoking is partially mediated through LBW, since there still is a direct effect of smoking on mortality. The authors state that this DAG is actually the simplest model, because it is "complete" (I suppose they mean this in the graph-theory sense of "complete"). In other words, the previous two graphs had very strict constraints on relationships that allowed us to "test" the graphs. The above graph doesn't have this property, and technically has the least amount structure. However, it probably presents the most accurate model of reality thus far.
This is where the story gets interesting. If you showed this model to domain experts, they would say it is too simple. A domain expert would suggest that there are other, possibly unobserved, factors involved in BW, like birth defect. If this is true, bad things can happen. Let's explore this in diagram 4. (From here, I'll actually diverge a bit from the paper, but only because the paper does it's presentation of diagrams in an odd manner.)
This new diagram introduces an unobserved variable (in blue), birth defects, and removes some causal links - but as we'll see even this simple diagram is enough to give us our "paradox". Birth defects also causes LBW and also has a (strong) positive effect on mortality. So, given this diagram, let's see what happens when we condition on LBW. Among LBW infants, if the mother didn't smoke, then we know the only other cause for LBW was birth defects. By conditioning on LBW, we have introduced a spurious correlation between smoking and birth defects ("if not this, than that") - in fact, this association is negative. And because birth defects has a positive association with mortality (and smoking has none), we have introduced a negative association between smoking and mortality.
The problem occurs because birth defects is unobserved. If we had observed birth defects, we could have controlled for it and everything would have been fine.
The story above doesn't mention observations - so why is observation so important? Even though we don't observe birth defects, the "effect" of it is still present in the outcomes. And because BW is caused by it, the effect of birth defects is passed through to BW in any analysis. This means that even though BW has no effect on the outcome, it will still look like it because it's highly correlated with birth defects.
Let's introduce a link between smoking and mortality now. The authors state
In this case, the rate ratio adjusted for birth weight would be a combination of 1) the true direct effect of smoking on mortality and 2) the bias resulting from the presence of common causes of LBW and mortality.
It's not obvious which effect is larger in magnitude, apriori. Similar to above, if we could control for birth defects, this bias would vanish.
With this diagram, we've added a link between LBW and mortality, adding more plausibility to our diagram. The same bias still remains, however.
The authors then put forward another diagram that suggests that only some types of low birth weight (LBW_B) are actually harmful. The argument is as follows.
Some instances of LBW are not harmful, like being born in a higher altitude. That is, LBW caused by higher altitude infants do not see a higher risk than their counterfactual counterparts born at normal BW in a low altitude area. This LBW is of the type LBW_A. However, a LBW in a low altitude area is of much higher risk, since the cause of the their LBW isn't LBW_A but LBW_B, and is of much higher risk. In this case, if we condition on LBW, we would introduce the same biases observed above.
This is probably my favourite part of the paper. The authors ask "why are we conditioning on BW anyways?" Researchers can be interested in two questions: what is the total effect of smoking, and what is the direct effect of smoking (aside: I learned a lot about the differences between total and direct effects by reading about the Table 2 fallacy.) If the researchers' goal is compute the total effect of smoking on mortality, then adjustment for a child variable (like BW) is not correct. (The author's state that, for many sensible reasons, BW is a cause of smoking, and not vice versa). In fact, the only adjustment needed is for confounders of smoking and mortality. This, according to the back-door criteria, is sufficient to compute the total adjustment.
However, if the researchers goal is to compute the direct effect, things are more difficult. To compute a direct effect, one needs to control for any mediators of smoking. This makes controlling for BW sensible. But then, we need to make sure that if those mediators are colliders (have more than a single cause), we need to control for all their parents too to satisfy the backdoor criteria. This is what happened in Diagram 4 above: because we conditioned on BW, but failed to condition on unobserved birth defects, our estimate of the direct effect of smoking was biased.
In reality, I imagine the original researchers did not have a causal diagram in mind when they first encountered the paradox. I doubt they even had the vocabulary of direct or total effects.
In conclusion, I hope this articled helped understand this highly important and influential paper. Apriori, we know that smoking is bad (hence the confusing "paradox"), but how often might this happen in less conspicuous contexts? Without explicitly drawing or stating a causal diagram, it's easy to make the researchers' original mistake and think some interesting phenomenon is occurring.
]]>
The idea of grouping data science tasks has been bouncing around in my head for the past six or so months, and recently I came across a paper by Miguel Hernán¹ that summarised the ideas nicely. In fact, the metaphorical "pillars" idea came from his paper. This blog article is a combination of his ideas and my own. With that, I present what I (and he) consider the three pillars of data science.
Description is what I feel the majority of data scientists in industry focus on. This could be something as simple as mean, a conversion rate, a time series, etc. or something more advanced like a statistical test. Description also includes data mining algorithms like clustering and association rules.
This is a well explored area, thanks to the 20th century focus on descriptive statistics and methods. This was the result of a few simultaneous conditions:
While this is what I feel is the most common pillar in industry, it's not talked much about in the community. I wonder if is this because it's due to the classical feel of these techniques? Or perhaps it's what I would prefer but do not believe, is that they are so well studied and understand that we need not talk about them. (Why do I not believe this? Take a look at Andrew Gelman's terrific blog almost any time he posts an article. It's always full of mistakes other researchers make, let alone people in industry). No, most of the community's attention is given to the next pillar: prediction.
We spend most of our time thinking, implementing, watching and talking about prediction. Many times a day, a new ML prediction algorithm or article will surface on Hacker News. Contrast where we are today with the previous century: we have lots of compute power, available datasets and accessible technology for prediction. Prediction, in some cases, has been widely profitable for companies. (Not to say description hasn't either. Pharma companies rely on descriptive methods to validate their work. Banks and insurance companies, too, rely on descriptive methods to control their business investments - no other technique could help.) I say some cases, because I do see a lot of excess application of prediction, often where description (or the third pillar) would be sufficient.
The next pillar is the least talked about. Surprisingly too, considering the importance of it.
Sometimes I feel ashamed to admit how ignorant I really am. To be transparent, I didn't learn much causal inference (or care) until about nine months ago. And even then, it was just fumbling around in a dark room of ideas. Slowly, however, I picked up the pieces and learned what to learn. So why should a data scientist know about causal inference?
From my experience, a data scientist gets asked questions that can only be answered with a causal inference lens. Questions like, to use my company Shopify as an example:
These questions, because of the complex nature of the system, cannot be answered with traditional summary statistics. Confounders abound, and populations are biased. We require new tools and ideas that come from causal inference.
Randomised control trials (RCTs) are very common in tech companies. This classical technique is actually generalised in causal inference frameworks. Beyond this, as a data scientist working in industry, you'll have a new perspective on what your observational data can (and cannot!) deliver.
What I find worrying is that, as a community, we are not talking about causal inference enough. This could have some serious harm. If practitioners in industry aren't aware of techniques from causal inference, they will use the wrong tool and could come to the wrong conclusion. For example, both Simpson's Paradox and Lord's Paradox, two very common phenomena, are easy explainable in a causal framework, but without this framework, people can make disastrous and harmful conclusions. For example, in the low birthweight paradox (see paper below), without causal inference there is the potential to conclude that smoking helps low birthweight infants! Furthermore, I rarely see school programs or bootcamps in data science teaching causal inference. It's almost exclusively prediction and technology!
So, rather than describe the slow and meandering way I discovered causal inference, I'll post some links below to give you a head start (probably read these in order):
There's so much more, but at least this gives you a place to learn what to learn.
This article will likely evolve over time, as I discover more and more about the boundaries of these three pillars. But I hope the reader recognises a pillar they are lacking in (for me it was causal inference). Based on this construction, I do expect there to be huge potential for practitioners of causal inference (ex: econometricians and epidemiologists come to mind) in teaching data science.
]]>Take a look at this string of numbers:
333 2 333 2 333 2 33 2 333 2 333 2 333 2 33 2 333 2 333 2 …
At first it looks like someone fell asleep on a keyboard. But there’s an inner logic to the sequence:
This sequence creates itself, number by number, as explained in the image above. Each digit refers to the number of consecutive 3s before a certain 2 appears. Specifically, the first digit refers to the number of consecutive 3s that appear before the first 2, the second digit refers to the number of 3s that appear consecutively before the second 2, and so on toward infinity.
The sequence never ends, but that won’t stop us from asking us questions about it. What is the ratio of 3s to 2s in the entire sequence?
Let's first implement the sequence in Python. We can use a queue to hold the sequence, and when we pop a new element off the queue, we can extend the queue (adding three 3s and a 2, or two 3s and a 2) based on the value of the popped element.
We could brute force the solution by computing the ratio as we extend the sequence ever longer, but this will only ever give us an approximation, and no realisation as to why the solution is so.
I'm going to show off how I played around with the problem, naively, and then started to formulate a solution. The first thing I computed was the ratio of 2s present in the sequence. This isn't the desired answer (ratio of 3s over 2s), but the ratio of 2s, which i'll call \(R\), can be used to find the original solution too: \(\frac{1- R}{R}\). Also, \(R\) can be seen as a probability, something that didn't turn out to be useful but was a possible attack. Using the Python code, R was computed to be approximately 0.2679491. So I had some place to start.
Next I looked at where the 2s were placed in the sequence. Ex: position 4, 8, 12, 15, and so on were 2s. Extending, 2s were placed at positions:
4, 8, 12, 15, 19, 23, 27, 30, 34, 38, 42, 45, 49, 53, 56, 60, 64, 68, ...
Inspecting this, it's clear that 2s will occur every fourth number, except every 4th time, when it appears after three numbers. Call this last exception a "skip". Playing around with this pattern, I constructed a crude function to estimate the position of the \(n\)th 2, \(4n - 4n / (16)\). (I didn't simplify this to make it more clear: the first term is where I expect the position to be given there were no "skips"; the second term is the estimated number of skips after \(4n\) elements of the sequence)
It gave pretty close results for values of \(n\) under 50. I could also estimate \(R\) by looking at \( \frac{n}{4n - 4 n / 16} \). For large \(n\), this gave a number pretty close to my Python approximation of \(R\) (above). But it wasn't close enough, and something was wrong. Simplifying the expression to \(\frac{1}{4 - \frac{1}{4}}\) - I guessed that maybe this looks like a continued fraction! Trying a more accurate approximation \(\frac{1}{4 - \frac{1}{1 - \frac{1}{4}}}\) gave an even better result! Newly motivated, I went back to my "crude approximation function". In the limit, the correct function would look like \( 4n - Rn\). Why? Again, given no "skips" we expect the \(n\)th 2 to be at \(4n\), then subtract the number of "skips", which is exactly the number of 2s, given by \(Rn\). Thus \(R = \frac{1}{ 4 - R} \). (This confirms the continued fraction). Solving the quadratic, \(R = 2 - \sqrt{3} \approx 0.26794919 \).
Now we can solve the original problem, given by \(\frac{1- R}{R} = \frac{\sqrt{3} - 1}{2 - \sqrt{3}} \approx 2.7320508\).
[Image credit to FiveThirtyEight]
]]>Last week, I achieved a goal of hitting 200 problems solved in Project Euler. If you are unfamiliar with Project Euler, it's a website where users can solve tricky problems that require both mathematics and programming to solve. An example question:
(BBBW) | (B,BBW) | (B,B,BW) | (B,B,B,W) |
(B,BB,W) | (BBB,W) | (BB,BW) |
I started in Jan. 2016, and was able to collect my progress of completion over time:
There is also the concept of the the "difficulty" of a problem, provided by the website itself as a measure of how often participants get the answer wrong. Here is the cumulative difficulty:
I'm not finished with Project Euler yet - there are way to many problems I spent time on but didn't solve and they are just itching me. However, I am taking a break for a while so I can focus on some other projects. After all, solving 200 problems was technically my lifetime goal.
If a person asked me "is Project Euler worth the time?" - I wouldn't immediately say yes. You do need lots of time, and it really is a personal thing (there is a leaderboard, but it's not nearly as public as, say, StackOverflow's leaderboards). You also need to think about the opportunity cost - will solving these problems further your career or research? Personally, I feel that Project Euler rounded out my understanding of computer science and mathematics, and not so heavily loaded on statistics and Python applied to statistics.
My problem was not as easy as this:
Essentially, all the information I had was the following: my job threw a NullPointerException somewhere when it tried to process this new dataset:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 533 in stage 1.0 failed 4 times, most recent failure: Lost task 533.3 in stage 1.0. java.lang.NullPointerException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
...
But I still needed to finding this offending row. The idea of a binary search over the dataset appealed to me, but the data was distributed: there was no sense of an “order” to it, that is, I had no concept of a middle value to split the data on. The next idea I thought of was using modulus. Suppose each row had an integer 1,2,3,… associated with it, then I could split the data modulo 2 and test either side, 0 or 1, for a failure. Suppose that the 1s-half contained the failure. Then I should look at splitting the dataset by 1 and 3 modulo 4. And so on. This idea could be applied recursively. In practice, it relies on the fact that, given \(x \equiv k \bmod 2^n\) implies either \(x \equiv k \bmod 2^{n+1}\) or \(x \equiv k + 2^n \bmod 2^{n+1}\). Using this, I can recursively halve my dataset until I have found a single row (or a much smaller set atleast). I’ll explain how to associate each row to an integer in a moment, but first, here’s a sample algorithm in Python:
This is a toy example, but the gist of it is that each round I halve the dataset
and test one of the halves for the desired behaviour (implemented in test
). The running time of this algorithm is \(\mathcal{O}(T\log{n})\), where \(n\) is the size of the dataset and \(T\) is the running time of the job.
This assumes there is an easy way to assign an integer to each row. When implementing this idea in Spark to a distributed dataset, there is no way to index each row consecutively (Spark does have zipWithIndex
in its RDD layer, though this does an internal collect
which isn’t very desirable.) However there is a monotonically_increasing_id
function, which gives each row a monotonically increasing integer, but not necessarily consecutive. Fortunately this algorithm still works (and typically this is where a normal binary search would break). The runtime however is no longer \(\mathcal{O}(T\log{n})\), but \(\mathcal{O}(T\log{M})\), where \(M\) is the maximum integer assigned. So in Spark, my implementation to find the offending row looked like:
So I let this run for an hour or so, and it worked! I was very happy to find the offending row. Though next time, I would change this:
cache
right before your job fails, else you’ll be rerunning the same computations over and overI’m not totally sure this is the fastest way to find the offending line (though, given the tools plus information provided, it might be), but it works and is reliable.
]]>$$ \min_{\theta} -\ell(\theta) + \lambda ||\theta||_p^p $$
where \(\ell\) is the log-likelihood and \(||\cdot||\) is the \(p\) norm. This family of problems is typically solved by calculus because both terms are easy to differentiate ( when \(p\) is an integer greater than 1). Actually this is backwards: we want to solve the MLE using calculus (because that's our hammer) and in order to add a penalizer term, we also need it to be differentiable. Hence why we historically used the 2-norm.
If we don't solve the optimization problem with calculus, well then we can be more flexible with our penalizer term. I took this liberty when I developed the optimizations in lifetimes, my Python library for recency/frequency analysis. The MLE in the model is too complicated to differentiate, so I use numerical methods to find the minimum (Nelder-Mead to be exact). Because of this, I am free to add any penalizer term I wish, deviating as I choose from the traditional 2-norm. I made the choice of \(\log\), specifically:
$$ \min_{\alpha, \beta} -\ell(\alpha, \beta) + \lambda \left( \log(\alpha) + \log(\beta) \right) $$
This was my logic when I first developed lifetimes. Things were going well, until I started noticing some datasets that would produce unstable convergence only when the penalizer coefficient \(\lambda\) was positive: it was driving some variables to nearly 0. How could this be? Shouldn't any positive penalizer coefficient help convergence? For this, we'll take two perspectives of this problem.
I probably don't need to say it, but the log of a value less than 1 is negative. More extreme, the log of a very small value is very very negative (because the rate of change of log near zero gets larger as we approach the asymptote). Thus, during optimization, when a parameter starts to get small, the overall penalizer term starts to gain momentum. In fact, the optimizer starts to shrink a particular parameter to near 0 because that really really helps the overall optimization.
This is obviously not what I wanted. Sure, I wanted to keep values from being too large, but I certainly did not want to reduce parameters to near zero! So it made sense that when I had \(\lambda\) equal to 0 I did not observe this behaviour.
On the other extreme, the \(\log\) penalizer is kinda a terrible penalizer against large values too. An order of magnitude increase in a parameter barely makes a difference in the log of it! It's an extremely sub-linear function, so it doesn't really penalize large parameter sizes well.
As noted in a chapter in Bayesian Methods for Hackers, there is a really beautiful and useful relationship between MLE penalizer terms and Bayesian priors. Simply, it comes down to that the prior is equivalent to the negative exponential of the penalizer term. Thus, the 2-norm penalizer term is a Normal prior on the unknowns; the 1-norm penalizer term is a Laplace prior, and so on. What is then our \(\log\) prior? Well, it's a \(\exp(-\log(\theta)) = \frac{1}{\theta}\) prior on \(\theta\). This is strange, no? It's an improper prior, and it's in fact a Jeffery's prior, so I'm basically saying "I have no idea what this scale parameter should be" - not what I want to be doing.
As much as I like the scale-free property of the \(\log\), it's time to say goodbye to it in favor of another. I think for my purposes, I'll try the Laplace prior/1-norm penalizer as it's a nice balance between disallowing extremely large values and not penalizing the largest parameter too strongly.
Update: I actually went with the square penalizer in the end. Why? The 1-norm was still sending too many values to zero, when I really felt strongly that no value should be zero.
]]>Let \(S_n = \sum_{i=1}^n U_i \) be the sum of \(n\) Uniform random variables. Let \(N\) be the index of the first time the sum exceeds 1 (so \(S_{N-1} < 1\) and \(S_{N} \ge 1\)). The distribution of \(U_N\) is \(e - e^{1-u}\).This result is related to the really astounding result: \(N\) (the same \(N\) defined above) has expected value equal to \(e\). Yes, that's right, the average number of uniform random variables needed to exceed 1 is exactely \(e\). I just love this surprising result, but have still not found a good, intuitive reason as to why this is true. Of course it is true mathematically, and the proof is not too hard, but I just can't see why. Anyways, below is the proof of the first result.
I've seen some really interesting probability & numerical solutions using a strategy called Poissonization, but Googling for it revealed very few resources (just some references in some textbooks that I don't have quick access to). Below are my notes and repository for Poissonization. After we introduce the theory, we'll do some examples.
The technique relies on the following theorem:
Theorem: Let \(N \sim \text{Poi}(\lambda)\) and suppose \(N=n, (X_1, X_2, ... X_k) \sim \text{Multi}(n, p_1, p_2, ..., p_k)\). Then, marginally, \(X_1, X_2, ..., X_k\) are are independent Poisson, with \(X_i \sim \text{Poi}(p_i \lambda)\). [1]
The proof of this theorem is as follows. First, by the law of total probability:
$$
\begin{align*}
P(X_1=x_1, ..., X_k = x_k\;|\; n, \mathbf{p})
& = \sum_{n=0}^{\infty} P(X_1=x_1, ..., X_k = x_k \;|\; N=n) \frac{e^{-\lambda}\lambda^n}{n!} \\
& = \sum_{n=0}^{\infty} \frac{(x_1 + ... + x_k)!}{x_1!x_2! \cdots x_k!}p_1^{x_1}\cdots p_k^{x_k}\frac{e^{-\lambda}\lambda^n}{n!}I_{x_1 + ... + x_k = n} \\
& = e^{-\lambda}\lambda^{x_1} \cdots \lambda^{x_k}p_1^{x_1} \cdots p_k^{x_k}\frac{1}{x_1!x_2! \cdots x_k!}\\
& = e^{-\lambda}(\lambda p_1)^{x_1} \cdots (\lambda p_k)^{x_k}\frac{1}{x_1!x_2! \cdots x_k!}\\
& = \prod_{i=1}^k \frac{e^{-\lambda p_i}(\lambda p_i)^{x_i}}{x_i!}\;\;\;\;\blacksquare
\end{align*}
$$
Let's see how we can use this to solve probability problems. Let \((Y_1, Y_2, ...Y_k)\) be \(\text{Multi}(n, p_1, p_2, ..., p_k)\), and we are interested in when the realized values fall into some set \(A\). Let \(X_i\) be defined as above, that is, \(X_i \sim \text{Poi}(p_i \lambda)\). Then, from above, we can write:
$$
\begin{align*}
P((X_1, ... X_k) \in A) & = \sum_{i=0}^\infty \frac{e^{-\lambda}\lambda^n}{n!} P((Y_1, ... Y_k) \in A)\\
& = e^{-\lambda} \sum_{i=0}^\infty \frac{\lambda^n}{n!} P((Y_1, ... Y_k) \in A)\\
& e^{\lambda}P((X_1, ... X_k) \in A) = \sum_{i=0}^\infty \frac{\lambda^n}{n!} P((Y_1, ... Y_k) \in A) \;\;\;(★)
\end{align*}
$$
Both sides of the last line are infinite series of \(\lambda^n\), and thus their coefficients must be equal. Thus, as \(N=n\), \(P((Y_1, ... Y_k) \in A)\) is equal to \(n! c(n)\) where \(c(n)\) is the coefficient of \(\lambda^n\) in the expansion of \(e^{\lambda}P((X_1, ... X_k) \in A)\).
Let's try an example:
n defects are randomly distributed amongst k integrated-circuit chips produced by a factory (any number of defects may be found on a chip and each defect is independent of the other defects). If n = 4, k=7, find the probability that there is a chip with at least 3 defects.
This is a classic case where it's easier to compute the complementary probability: what is the probability that all chips have less than 3 defects (and then we subtract this from 1 to get the original probability).
Since each defect is randomly distributed, \(p_i = p = \frac{1}{7}\). Below, we invoke Poissonization by letting \(n\) be randomly distributed as \( \text{Poi}(\lambda)\) (later we will condition on its actual value of 4). That means that we can assume the number of defects on a chip is a Poisson random variable with rate \(\lambda p = \frac{\lambda}{7}\).
$$
\begin{align*}
P(\text{All chips have $\le$ 2 defects})
& = \Big(P(\text{Chip has 0 defect}) + P(\text{Chip has 1 defect}) + P(\text{Chip has 2 defect})\Big)^7 \\
& = \Big(e^{-\lambda/7} + \frac{e^{-\lambda/7}(\lambda/7)^1}{1!} + \frac{e^{-\lambda/7}(\lambda/7)^2}{2!}\Big)^7\\
& = e^{-\lambda}\Big(1 + \frac{\lambda}{7} + \frac{\lambda^2}{2!7^2}\Big)^7\\
\end{align*}
$$
When we multiply the above by \(e^{\lambda}\), and imagine the right-hand side is expanded, we get a form that is familiar to equation (★). This means that we are interested in the coefficient of \(\lambda^4\). We could expand this by hand, but instead let's expand this computationally. For this, it's useful to know a bit about multinomial expansions. But basically you are looking for solutions to
$$
k_0 + k_1 + k_2 = 7\\
0 k_0 + 1 k_2 + 2 k_2 = 4
$$
The first equation comes from the constraint for the powers of a multinomial expansion to sum to the outside exponent. The second equation comes from our need to find powers of \(\lambda\) that sum up to 4 (the 0,1,2 come from the existing powers of \(\lambda\)). Two equations, three unknowns, solving for the values:
$$
\begin{align*}
k_2 &= k_0 - 3 \\
k_1 &= 10 - 2k_0\\
3 &\le k_0 \le 5
\end{align*}
$$
Computationally, we can write this in Python:
We get the answer 0.92711, so the complement to this is 0.07289 and is the probability of the original question.
This comes from FiveThiryEight's Riddler series: Santa Needs Some Help With Math. I'll quote the problem here:
In Santa’s workshop, elves make toys during a shift each day. On the overhead radio, Christmas music plays, with a program randomly selecting songs from a large playlist.
During any given shift, the elves hear 100 songs. A cranky elf named Cranky has taken to throwing snowballs at everyone if he hears the same song twice. This has happened during about half of the shifts. One day, a mathematically inclined elf named Mathy tires of Cranky’s sodden outbursts. So Mathy decides to use what he knows to figure out how large Santa’s playlist actually is.
Help Mathy out: How large is Santa’s playlist?
Okay, so let's translate this to our set up. We have \(k\) songs to choose from, and we sample 100 of them. Turns out that P(any song is played more than once) = 0.5. So in this case, we know the probability, but we don't know the \(k\). So, similarly to how we started above:
$$
\begin{align*}
P(\text{songs have 1 or less plays})
& = \Big(P(\text{songs has 0 plays}) + P(\text{songs has 1 plays})\Big)^k \\
& = \Big(e^{-\lambda/k} + \frac{e^{-\lambda/k}(\lambda/k)^1}{1!}\Big)^k\\
& = e^{-\lambda}\Big(1 + \frac{\lambda}{k}\Big)^k\\
\end{align*}
$$
Yay we have a binomial! They are much easier to work with. Now we need to answer:
What is the coefficient of \(λ^{100}\)?
Call this coefficient \(c_{100}(k)\). Since we have a binomial, expanding the above is easy:
$$
c_{100}(k) = \frac{C(k, 100)}{k^{100}}
$$
Using the other piece of information above, that the probability of an "event" is 0.5, we can write an equation down:
$$
\begin{align*}
0.5
& = 100! c_{100}(k) \\
& = 100! \frac{C(k, 100)}{k^{100}} \\
& = \frac{k!}{(k-100)! k^{100}} \\
\end{align*}
$$
$$
2\frac{k!}{(k-100)!} - k^{100} = 0
$$
We can use Python (!!) to numerically solve this equation for \(k\).
Because the sign flips at 7175, we conclude that either 7174 or 7175 is the closest answer.
Looking for more examples?
[1] DasGupta, A. Probability for Statistics and Machine Learning. http://link.springer.com/book/10.1007%2F978-1-4419-9634-3
]]>
Python developers are commonly using Python as a tool to explore datasets - but what if we reverse that analysis lens back on to the developer? In this talk, Cam will use Python as a data analysis tool to explore Python developers and code. With millions of data points, mostly scraped from Github and Stackoverflow, we'll reexamine who the Python developer is, and how Python developers write code.
]]>
You can purchase it on Amazon today!
]]>On the other hand, a non-constructive proof does not detail how to build the object, just states that it must exist. Often this is done by finding a proof by contradiction, but some interesting proofs use probability theory. For example, if I'm interested in showing that there exists an object with some property less than 0, then an argument of the form "the expected value of this property is 0, and if I know there is an object with property more than 0, there must be an object with property less than 0, too". I didn't find that specific object, but I determined it must exist.
There is an interesting trend occurring in data analysis that is analogous to a non-constructive proof. The author won't find the object desired, but will state, statistically, that it does exist. Sound odd?
Suppose it was early 2015, and I was interested in analyzing the Billboard's 2015 Song of the Summer. I might ask, "has the Song of the Summer been released yet?" By analyzing when the winning song was released in previous years, I can get a distribution of when this year's Song of the Summer might be released. If I'm past the majority of previous dates, then I can statistically conclude that 2015's Song of the Summer exists. I haven't shown what the song of the summer is, just that it exists.
In fact, this is what FiveThirtyEight did this year: they suggested that on May 5th, you have probably already heard the 2015 Song of the Summer. They didn't guess which song, just that it existed.
I find this to be a really powerful technique. By showing something exists, it can narrow down your search significantly. Suppose you and a friend were betting on what movie will win the Best Picture Oscar. You could probably do quite well by betting on all movies made after October, and nothing before. This is because the Oscar winner often is released (exists) after October.
Another extension, from a decision maker's perspective, is to understand when something should exist, and place yourself there. For example, good luck to the artist who hopes to achieve Song of the Summer with a late May release! Similar ideas apply to presidential elections (when should a candidate enter the nominee race to maximize chances of being president?) and theatrical releases hoping to win Oscars.
Any other applications? This feels like one of those tricks that is not generally applicable, but very powerful in a some situations. Leave other ideas in the comments!
]]>
This is a very simple Bayesian problem, on paper! See Allen's book for the solution (link above). What if we wanted to use PyMC to solve it? Below is the solution (originally from CrossValidated).
It looks like there is a bug in 2.3.4 (the most recent version) that is causing the wrong inference. This took me a while to discover, but it was solved when I downgraded to PyMC 2.3.2.
import pymc as pm
p = [
#brown, yellow, red, green, orange, tan, blue
[.3, .2, .2, .1, .1, .1, .0 ], # 1994 bag
[.13, .14, .13, .2, .16, .0, .24] # 1996 bag
]
bag = pm.Categorical('which_bag', [0.5, 0.5]) # prior on bag selected
@pm.deterministic
def first_bag_selection(p=p, bag=bag):
return p[bag]
@pm.deterministic
def second_bag_selection(p=p, bag=bag):
return p[1-bag]
# observe yellow in bag 1
obs_1 = pm.Categorical('bag_1', first_bag_selection, value=1, observed=True)
#observe green in bag 2
obs_2 = pm.Categorical('bag_2', second_bag_selection, value=3, observed=True)
mcmc = pm.MCMC([bag, p, first_bag_selection, second_bag_selection, obs_1, obs_2])
mcmc.sample(15000,3000)
bag_trace = mcmc.trace('which_bag')[:]
# what is the probability the bag chosen is from 1996?
print (bag_trace == 1).sum()*1./bag_trace.shape[0]
# should be about .26
]]>
We construct a dynamical population whose individuals are assigned elements from an algebraic group \(G\) and subject them to sexual reproduction. We investigate the relationship between the dynamical system and the underlying group and present three dynamical properties equivalent to the standard group properties.
Consider a sufficiently large population of \(N\) individuals. Sexual reproduction between individuals is a binary action, that is, sex involves two individuals. An interesting assumption is to consider if the population consists of individuals from an algebraic group. Recall the definition of a group : A set, \(G\), and binary operator \(\ast\) is called a group iff
there exists an identity element, \(e\), s.t. \(\forall x \in G, x\ast e=e \ast x = x\).
\(\forall x \in G\) there exists an inverse, \(x^{-1}\), s.t. \(x\ast x^{-1}=x^{-1}\ast x = e\).
\(\forall x,y \in G, x\ast y \in G\) and \(y\ast x \in G\).
What would the behaviour of a sexual system look like if all individuals were members of a group? Can we find such systems in nature, or are our constructions purely artificial?
We call the order of group \(G\) the number of elements in the group, denoted \(|G|\). A group, \(G\), is called abelian if the operation is commutative, i.e. \(\forall x,y \in G, x\ast y = y\ast x\). A subset \(H \subset G\) is called a subgroup iff \(e\in H, \forall x \in H \Rightarrow x^{-1} \in H\), and \(H\) is closed under the binary operation. A Cayley table of a group is \(n \times n\) matrix where the \((ij)th\) entry is the product of the \(ith\) and \(jth\) element. For example, consider \(\mathbb{Z}_2\times \mathbb{Z}_2=\{(0,0),(1,0),(0,1),(1,1)\}\) under addition mod 2. The Cayley table is \[\left[ \begin{array}{cccc} (0,0) & (1,0) & (0,1) & (1,1)\\ (1,0) & (0,0) & (1,1) & (0,1)\\ (0,1) & (1,1) & (0,0) & (1,0)\\ (1,1) & (0,1) & (1,0) & (0,0) \end{array} \right]\]
This matrix completely defines the group.
We describe a stochastic process that will characterize our evolving population. Consider again a population of \(N\) individuals where \(N_i(t)\) denote the number of \(i\) individuals in the population. We consider the interaction between two individuals to represent sexual reproduction. The birth-death process begins by two individuals being randomly chosen proportional to their frequency in the population. The new offspring ( product of the two individuals) is substituted for another randomly chosen individual, again proportional to frequency (thus the population size is constant). The process is repeated for each time step. For example, consider the group \(\mathbb{Z}_2=\{0,1\}\). Recall that in this group \(1\ast1=0\), so the probability that the population of 1s increases, i.e. \(N_1(t+1) = N_1(t) + 1\), is equal to the probabilty a 1 and 0 are chosen as parents, and a 0 is chosen to die:
\[\begin{aligned} P^{+} &=P(N_1(t+1)=N_1(t)+1| N_1(t))\\ &=\left( \frac{N_1(t)}{N} \frac{N_0(t)}{N-1}+\frac{N_0(t)}{N} \frac{N_1(t)}{N-1}\right) \frac{N_0(t)}{N}\\ &=\left( \frac{2N_1(t)}{N} \frac{N_0(t)}{N-1}\right) \frac{N_0(t)}{N}\end{aligned}\]
since for the population \(N_1(t)\) to increase by one in a time step, the new individual resulting from the interaction must be a 1 and the death must be a 0. Similarly,
\[\begin{aligned} P^{-} &=P(N_1(t+1)=N_1(t)-1| N_1(t))\\ &=\left( \frac{N_1(t)}{N} \frac{N_1(t)-1}{N-1}+\frac{N_0(t)}{N} \frac{N_0(t)-1}{N-1}\right) \frac{N_1(t)}{N}\\\end{aligned}\]
and \(P^{0}=P(N_1(t+1)=N_1(t)| N_1(t))=1-P^{+} -P^{-}\). Thus we have
\[\begin{aligned} E[N_1(t+1)|N_1(t)] &=P^+\cdot (N_1(t)+1)+P^-\cdot (N_1(t)-1)+P^0\cdot (N_1(t))\\ &=P^+-P^-+N_1(t)\Rightarrow\end{aligned}\]
\[E[N_1(t+1)-N_1(t)|N_1(t)]=P^+-P^-\] Working with probabilities can be unwieldy, so we take the limit as \(\Delta t\rightarrow 0\) and \(N \rightarrow \infty\). If we denote\(N_1(t)/N \approx x\) (then \(N_0(t)/N \approx 1-x\)), we have
\[\begin{aligned} &P^+=2x(1-x)^2\\ &P^-=x(x^2+(1-x)^2) \Rightarrow \\ &\frac{dx}{dt} \approx P^+-P^-= x(1-2x)\end{aligned}\]
This system has asymptotically stable points at \(x=0\) and \(x=1/2\), i.e. when the population of 1’s is extinct or equal to the population of 0’s.
Initially, to model such an evolving system, we consider only \(\Delta t\rightarrow 0\) and \(N \rightarrow \infty\). Thus we enter the well-studied realm of continuous dynamical systems. Given a group of order \(n\), we need some way to introduce the group structure into the dynamical system. The Reverse Cayley matrix, \(A\), is defined as an \(n\times n\) matrix where the \(ith\) column of \(A\) contains the product of frequencies such that the product of the corresponding elements is the \(ith\) element of the group. For example, consider \(\mathbb{Z}_2\times \mathbb{Z}_2=\{(0,0),(1,0),(0,1),(1,1)\}\), and let \(e,x,y,z\) denote the frequencies of the enumerated set. Then define \(A\) as \[\left[ \begin{array}{cccc} e^2 & ex & ey & ez\\ x^2 & xe & xz & xy\\ y^2 & yz & ye & yx\\ z^2 & zy & zx & ze \end{array} \right]\] The matrix can easily be constructed from the group Cayley table. In each row, for every element \(g \in G\), there exists a pair of elements such that the product is \(g\). Compare the Reverse Cayley matrix with the the Cayley matrix for \(\mathbb{Z}\times \mathbb{Z}_2\) given in section 2.
The dynamics of the system are given by the following system of equations, analogous to the limiting case in the example with \(\mathbb{Z}_2\). It is straightforward to show the dynamics are the same. In what follows, \(\mathbf{1}\)=(1,1…,1) and \(\mathbf{e_i}\) are the standard euclidean basis vectors and \(x_i\) is the \(ith\) element in the group.
\[\begin{aligned} \dot{x}_j &=P^{+}_{j}-P^{-}_{j}\\ &=\left(\sum^{n}_{i=1}\ A_{ij}\right)(1-x_j) - \left(1-\sum^{n}_{i=1}\ A_{ij}\right)x_j\\ &= \sum^{n}_{i=1}\ A_{ij} - x_j \\ &= \mathbf{1}\cdot A \mathbf{e_j} - x_j \text{ which can also be rewritten, if convenient, as}\\ &=\mathbf{1}\cdot A (-x_1,\ldots, -x_{i-1},1-x_i,-x_{i+1}, \ldots,-x_n)^T\end{aligned}\]
This greatly simplifies our dynamics. In fact, it is now trivial to show that the barycenter of the \(n\)-simplex, \(S_n\), is an interior stable point: suppose the frequency vector \(\vec{\mathbf{x}}=(1/n, \ldots, 1/n)\), then \(A=A(\vec{\mathbf{x}})=\frac{1}{n^2}J_n\), where \(J_n\) is the matrix full of ones, then
\[\begin{aligned} \dot{x}_i &=\frac{1}{n^2}\mathbf{1}\cdot J_n (1/n,\ldots,1-1/n,\ldots, 1/n)^T\\ &=\frac{1}{n^2} n\mathbf{1}\cdot (1/n,\ldots,1-1/n,\ldots, 1/n)^T\\ &=\frac{1}{n}(1-\frac{1}{n}-\ldots -\frac{1}{n})=0\end{aligned}\]
For the rest of the paper, let \(x_1\) represent the frequency of the identity element of the underlying group \(G\). Recall the heuristic definition of an invariant manifold is a manifold, \(N\), such that if the system starts somewhere in \(N\), it stays in \(N\) for all time . A useful concept in this study is the support of an invariant manifold, \(N\), defined as the following: \(i\in Supp(N)\) iff \(x_i>0\) in \(Int(N)\). Define the order of the manifold, denoted \(|N|\), as \(|Supp(N)|\).
\(\{ i_1, i_2, ...,i_k\}\) is an subgroup of G iff the sub-manifold of the dynamical system with support \(\{ i_1, i_2, ...,i_k\}\) is invariant.
----------------------
Suppose a population is described by an underlying group \(G\), and \(H\) is a subgroup of \(G\). Let \(\vec{\mathbf{x}}\) denote the frequencies of each member of the population. Suppose the population starts at a point where only members of \(H\) are present. For example, for all \(i\in G \setminus H, x_i=0\). Let element \(i\) not be in \(H\). We can derive \(i\)'s frequency in the population by \(\dot{x}_i=\mathbf{1}\cdot A\mathbf{e_i}\). We then let the system evolve. I claim that all elements in the \(ith\) column of \(A(\vec{\mathbf{x}})\) are zero. If there was an element in the column that was non-zero, say \(x_k x_j\), then \(x_k x_j=x_i\) by definition of the reverse Cayley matrix \(A\). But as \(x_k, x_j\) are non-zero, elements \(k\) and \(j\) belong to \(H\). As \(H\) is a group, this implies \(k \ast j = i \in H\), a contradiction. Thus, as the \(ith\) column is all zero, \(\dot{x}_i=0\).
On the other hand, suppose that \(H\) is not a subgroup. Then there exists \(u,v \in H\) s.t. \(u\ast v =k \not \in H\) thus \(x_k=0\) while \(x_u,x_v>0\). Then \(x_u x_v\) is in the \(kth\) column of \(A\). Thus \(\dot{x}_k=\mathbf{1}\cdot A\mathbf{e_k}>0\) and the system is not invariant.
----------------------
The above proposition implies that not all sub-simplicies of this system are invariant. If we consider the group \(\mathbb{Z}_2\times \mathbb{Z}_2\), then the only invariant manifolds of the system, described in terms of sub-simplicies of \(S_4\), are: the interior of \(S_4\), the interior of the three edges projecting from \((1,0,0,0)\) and the point \((1,0,0,0)\). It is clear that, say, \(\mathbf{e}_2=(0,1,0,0)\) is not an invariant manifold of the system as \((1,0)\ast (1,0) = (0,0) \not= (1,0)\), thus the system will evolve away from that point. It is clear that all invariant sub-manifolds of the dynamical system are actually sub-simplicies of \(S_n\). Both terms will be used interchangeably for the remainder of the paper.
\(1\in Supp(N)\) for all invariant manifolds, \(N\), of the dynamical system.
----------------------
If all the 2-dimension simplicies extending from \(\mathbf{e}_1\) are invariant, then the underlying group is abelian.
----------------------
The assumption implies that \(\{e,i\}\) is a subgroup for all \(i \in G\). Thus, by the necessity of (sub)groups, \(i\ast i =e\) for all \(i \in G\). It is a well-known (and interesting) theorem in group theory that this condition implies the group is abelian.
There is an alternative way to describe the dynamical system that will be used in the next proposition. Define \(B_k\) as the binary Cayley matrix s.t. if elements \(i\ast j=k\), then \((B_k)_{ij}=1\) and 0 otherwise, for all \(1 \le k \le n\). Clearly \(\sum_k\ B_k = J_n\) where \(J_n\) is the \(n\times n\) matrix of all ones. A property of \(B_k\) is that the columns of the matrix are the standard euclidean basis vectors. It is simple to show the dynamical system can be rewritten as:
\[\dot{x}_i =\vec{\mathbf{x}}\cdot B_i \vec{\mathbf{x}} -x_i\]
The barycenter of an invariant sub-manifold is (asymptotically) stable.
----------------------
What follows is a partial proof. Taking the time derivative of the Lyapunov function \(V(\vec{\mathbf{x}}) = \sum\ x_{i}^2\) we get
\[\begin{aligned} V'/2 & = \sum_i\ \vec{\mathbf{x}} \cdot x_i B_i \vec{\mathbf{x}} - \sum_i x_{i}^2 \\ & = \vec{\mathbf{x}} \cdot \sum_i \ x_i B_i \vec{\mathbf{x}} - \vec{\mathbf{x}} \cdot I \vec{\mathbf{x}} \\ & = \vec{\mathbf{x}} \cdot ( D-I) \vec{\mathbf{x}}\end{aligned}\]
where \(D = \sum \ x_i B_i\). Next is an exercise to show that \(D-I\) is a (semi) negative definite matrix which is equivalent to showing its eigenvalues are (negative) non-positive. I have not been able to generalize the idea to every system yet.
----------------------
Using this new approach in notation, we can translate the properties of groups, outlined in the introduction, to the language of dynamical systems if the population follows our birth-death process.
Unique Identity Element: the only invariant sub-manifold with order 1 is \(\mathbf{e}_1=(1,0,..,0)\).
Unique Inverse Element: \(B_1\) is symmetric.
Closure: the interior of the simplex \(S_n\) is invariant.
----------------------
This is clear as the only subgroup of order 1 of any group is \(\{e\}\), thus the only invariant manifold with order 1 is a population of just identity elements. On the other hand, if the only invariant manifold with order 1 is \((1,0,..0)\), then \(\{x_1\}\) is the only subgroup of order 1 of the underlying group. Thus, as \(e\) belongs to every (sub)group, \(x_1=e\).
For any \(x \in G\) with enumeration \(i_x\), there exists a 1 in the \(i_x\) column of \(B_1\). This means, there exists a \(y \in G\) s.t. \(x*y=e\). As \(B_1\) is symmetric, this also implies in the \(i_y\)th column of \(B_1\), there is a 1 in row \(i_x\), implying \(y*x=e\) . A remark: this is a poor characteristic of the unique inverse element property, as it relies too much on \(B_i\). I would prefer a more dynamical definition.
Follows from the proof of proposition 3.1.
----------------------
Aside from the poor definition of a unique inverse element, it would be a interesting question to ask, if a dynamical system satisfies the three characteristics above then does there exists an underlying group? Let’s tentatively say, if a given dynamical system satisfies the above three properties, we call such a system a Group-Equivalent system. Then, if there exists an underlying group, the system is Group-Equivalent i.e. satisfies the above properties; if a system is Group-Equivalent, does there exist an underlying group “driving” the system? A system driven by group \(G\) is not unique as adding multiplicative constant does not change the dynamics.
Suppose a dynamical system has an underlying group structure. Then by Lagrange’s Theorem (the order of every subgroup of \(G\) divides the order of \(G\)) we know that the order of every invariant sub-manifold divides the order of \(S_n\) (recall \(|Int(S_n)|=n\)). Suppose now that a system is Group-Equivalent, and it is true that this implies an underlying group structure. Can we prove Lagrange’s Theorem using a dynamical approach?
The last few remarks foreshadow the possibilities of studying a dynamical system with an underlying group, or a Group-Equivalent system. Two ideas can next occur: First, we use dynamical systems theory to prove theorems in group theory; second, we use group theory to solve problems involving dynamical systems. I believe the former to be of little value outside of itself; the literature and research in group theory is already so large that this novel approach to group theory will lead to nothing novel in the field as a whole. The latter idea is more fruitful I believe. If it is possible to transform a dynamical system into a system satisfying the properties outlines above, then we can apply very strong ideas from group theory to the system.
Further extensions include including fitness levels to the populations and finding real-world applications of group-driven systems, both of which have been addressed, to some degree, in a companion paper to this, see Extensions of Population Dynamics of Algebraic Groups.
J. Hofbauer, K. Sigmund, The Theory of Evolution and Dynamical Systems. Cambridge UP. 1988.
M. Hirsch, S. Smale, R.L. Devaney Differential Equations, Dynamical Systems & An Introduction to Chaos. Elsevier Academic Press. 2004.
J.A. Gallian, Contemporary Abstract Algebra. Houghton Mifflin. 2006.
C. Davidson-Pilon Extensions of Population Dynamics of Algebraic Groups Submitted to Journal of Theoretical Biology 2011.
]]>Next, suppose you are interested in the sample average of a dataset that exists on many computers. No problem you think, as you create a function that returns the sum of the elements and the count of the elements, and send this function to each computer, and divide the sum of all the sums by the sum of all the counts.
Next, suppose you are interested in the sample median of that same distributed dataset. No problem you think, as you create a function that sorts the array and takes the middle element, and send this function to each computer, and - wait. That won't work.
The problem is is that the median of median is not equal to the median. This is why many SQL engines do not support a MEDIAN function (which left many analysts scratching their heads). What's needed is an algorithm that can approximate the median, while still being space efficient.
First published in 2013 by the uber-practical and uber-intelligent Ted Dunning, the t-Digest is a probabilistic data structure for estimating the median (and more generally any percentile) from either distributed data or streaming data. Internally, the data structure is a sparse representation of the cumulative distribution function. After ingesting data, the data structure has learned the "interesting" points of the CDF, called centroids. What do I mean by that? Consider the following very lopsided distribution:
(It's really a mixture of Normals). The data has empirical CDF:
The t-Digest will literally eat this dataset, and try to represent the "interesting" points of its cumulative distribution. We can plot the internal sparse representation of the CDF:
The vertical lines denote where the digest thinks "interesting and relevant" centroids are. This is really neat. We can see the digest has only kept the parts of the CDF where the CDF is changing fastest. This makes sense: these are where quantities like the percentile are changing quickest. Flat portions of the CDF, like near x=3 and x=8, only need to be summarized by a few points.
Running a small test locally, I streamed 8mb of pareto-distributed data into a t-Digest. The resulting size was 5kb, and I could estimate any percentile or quantile desired. Accuracy was on the order of 0.002%.
So we can stream in data to the t-Digest, but how can we use this data structure for map-reduce? In the map stage, we can process data into the t-Digest sequentially, so we have N t-Digests in N computers. At the reduce stage, we can define an addition operation on two t-Digests as follows. Create a new t-Digest and treat the internal centroids of the two left-hand side digests as incoming data. The resulting t-Digest is a only slightly larger, but more accurate, t-Digest. Amazing: the t-Digest literally eats itself.
Interested in the algorithm, but without any code to read (I can't yet read Ted's implementation in Java), I wrote a semi-efficient t-Digest in Python (with helpers from Cython). You can find it in the CamDavidsonPilon/tdigest repo. There are some further optimizations to do, but it's probably efficient enough for most offline work!
I highly recommend Dunning's original paper on the t-Digest below - it's an easy read. Him and some other authors are constantly improving the data structure as well, see issues here.
Lifetimes is my latest Python project. Below is a summary, but you can also check out the source code on Github.
As emphasized by P. Fader and B. Hardie, understanding and acting on customer lifetime value (CLV) is the most important part of your business's sales efforts. And (apparently) everyone is doing it wrong. Lifetimes is a Python library to calculate CLV for you.
More generally, Lifetimes can be used to understand and predict future usage based on a few lax assumption:
Lifetimes can be used to both estimate if these entities are still alive, and predict how much more they will interact based on their existing history. If this is too abstract, consider these situations:
Really, "customers" is a very general term here, (similar to "birth" and "death" in survival analysis). Whenever we have individuals repeating occurrences, we can use Lifetimes to help understand behaviour.
pip install lifetimes
Requirements are only Numpy, Scipy, Pandas.
The examples below are using the cdnow_customers.csv
located in the datasets/
directory.
from lifetimes.datasets import load_cdnow
data = load_cdnow(index_col=[0])
data.head()
"""
x t_x T
ID
1 2 30.43 38.86
2 1 1.71 38.86
3 0 0.00 38.86
4 0 0.00 38.86
5 0 0.00 38.86
"""
x
represents the number of repeat purchases the customer has made (also called frequency
). T
represents the age of the customer. t_x
represents the age of the customer when they made their most recent purchases (also called recency
).
We'll use the BG/NBD model first. Interested in the model? See this nice paper here.
from lifetimes import BetaGeoFitter
# similar API to scikit-learn and lifelines.
bgf = BetaGeoFitter()
bgf.fit(data['x'], data['t_x'], data['T'])
print bgf
"""
<lifetimes.BetaGeoFitter: fitted with 2357 customers, a: 0.79, alpha: 4.41, r: 0.24, b: 2.43>
"""
After fitting, we have lots of nice methods and properties attached to the fitter object.
Consider: a customer bought from you every day for three weeks straight, and we haven't heard from them in months. What are the chances they are still "alive"? Pretty small. On the other hand, a customer who historically buys from you once a quarter, and bought last quarter, is likely still alive. We can visualize this relationship using the Frequency/Recency matrix, which computes the expected number of transactions a artifical customer is to make in the next time period, given his or her recency (age at last purchase) and frequency (the number of repeat transactions he or she has made).
from lifetimes.plotting import plot_frequency_recency_matrix
plot_frequency_recency_matrix(bgf)
We can see that if a customer has bought 25 times from you, and their lastest purchase was when they were 35 weeks old (given the individual is 35 weeks old), then they are you best customer (bottom-right). You coldest customers are those that in the top-right corner: they bought a lot quickly, and we haven't seen them in weeks.
There's also that beautiful "tail" around (5,25). That represents the customer who buys infrequently, but we've seen him or her recently, so they might buy again - we're not sure if they are dead or just between purchases.
Let's return to our customers and rank them from "highest expected purchases in the next period" to lowest. Models expose a method that will predict a customer's expected purchases in the next period using their history.
t = 1
data['predicted_purchases'] = data.apply(lambda r: bgf.conditional_expected_number_of_purchases_up_to_time(t, r['x'], r['t_x'], r['T']), axis=1)
data.sort('predicted_purchases').tail(5)
"""
x t_x T predicted_purchases
ID
509 18 35.14 35.86 0.424877
841 19 34.00 34.14 0.474738
1981 17 28.43 28.86 0.486526
157 29 37.71 38.00 0.662396
1516 26 30.86 31.00 0.710623
"""
Great, we can see that the customer who has made 26 purchases, and bought very recently from us, is probably going to buy again in the next period.
Ok, we can predict and we can visualize our customers' behaviour, but is our model correct? There are a few ways to assess the model's correctness. The first is to compare your data versus artifical data simulated with your fitted model's parameters.
from lifetimes.plotting import plot_period_transactions
plot_period_transactions(bgf)
We can see that our actual data and our simulated data line up well. This proves that our model doesn't suck.
Most often, the dataset you have at hand will be at the transaction level. Lifetimes has some utility functions to transform that transactional data (one row per purchase) into summary data (a frequency, recency and age dataset).
from lifetimes.datasets import load_transaction_data
from lifetimes.utils import summary_data_from_transaction_data
transaction_data = load_transaction_data()
transaction_data.head()
"""
date id
0 2014-03-08 00:00:00 0
1 2014-05-21 00:00:00 1
2 2014-03-14 00:00:00 2
3 2014-04-09 00:00:00 2
4 2014-05-21 00:00:00 2
"""
summary = summary_data_from_transaction_data(transaction_data, 'id', 'date', observation_period_end='2014-12-31')
print summary.head()
"""
frequency recency T
id
0 0 0 298
1 0 0 224
2 6 142 292
3 0 0 147
4 2 9 183
"""
bgf.fit(summary['frequency'], summary['recency'], summary['T'])
# <lifetimes.BetaGeoFitter: fitted with 5000 customers, a: 1.85, alpha: 1.86, r: 0.16, b: 3.18>
With transactional data, we can partition the dataset into a calibration period dataset and a holdout dataset. This is important as we want to test how our model performs on data not yet seen (think cross-validation in standard machine learning literature). Lifetimes has a function to partition our dataset like this:
from lifetimes.utils import calibration_and_holdout_data
summary_cal_holdout = calibration_and_holdout_data(transaction_data, 'id', 'date',
calibration_period_end='2014-09-01',
observation_period_end='2014-12-31' )
print summary_cal_holdout.head()
"""
frequency_cal recency_cal T_cal frequency_holdout duration_holdout
id
0 0 0 177 0 121
1 0 0 103 0 121
2 6 142 171 0 121
3 0 0 26 0 121
4 2 9 62 0 121
"""
With this dataset, we can perform fitting on the _cal
columns, and test on the _holdout
columns:
from lifetimes.plotting import plot_calibration_purchases_vs_holdout_purchases
bgf.fit(summary_cal_holdout['frequency_cal'], summary_cal_holdout['recency_cal'], summary_cal_holdout['T_cal'])
plot_calibration_purchases_vs_holdout_purchases(bgf, summary_cal_holdout)
Basic on customer history, we can predict what an individuals future purchases might look like:
t = 10 #predict purchases in 10 periods
individual = summary.iloc[20]
# The below function may be renamed to `predict` in a future version of lifetimes
bgf.conditional_expected_number_of_purchases_up_to_time(t, individual['frequency'], individual['recency'], individual['T'])
# 0.0576511
Drop me a line at @cmrn_dp
!
The data analysis was done using demographica.
]]>%magic functions
.
Today, I've opened up my startup scripts in a github repo, StartupFiles. The repo comes with some helper scripts too, to get your started:
./bin/build_symlink
: for linking your startup directory to a local version of the repo. ./bin/ipython_analysis
: for going through your ipython history and looking for simple recommendations to add. (Example below)Here are some small videos about what you can do with the new startup scripts:
Just because I am always misspelling these!
Check out StartupFiles and let me know what you think below!
]]>Richard Dawkins, in his early book The Extended Phenotype, describes what he means when he says "statistically, X occurs". His original motivation was addressing a comment about gender, but it applies more generally:
If, then, it were true that the possession of a Y chromosome had a causal influence on, say, musical ability or fondness for knitting, what would this mean? It would mean that, in some specified population and in some specified environment, an observer in possession of information about an individual's sex would be able to make a statistically more accurate prediction as to the person's musical ability than an observer ignorant of the person's sex. The emphasis is on the word “statistically”, and let us throw in an “other things being equal” for good measure. The observer might be provided with some additional information, say on the person's education or upbringing, which would lead him to revise, or even reverse, his prediction based on sex. If females are statistically more likely than males to enjoy knitting, this does not mean that all females enjoy knitting, nor even that a majority do. [emphasis added]
I really enjoy his precise description of what statistics is. Ignore distributions, modelling, p-values and other statistical ideas for a moment: what statistics is really interested in is
For some event \(A\), what characteristic \(X\) allows \(P( A | X ) > P(A)\).
For example, in Dawkins case, \(A\) = person is a female, and \(X\) = person enjoys knitting. Thus \(P(\text{person is female}\; |\; \text{person enjoys knitting}) > P(\text{person is female})\).
Of course, in reality I don't have this much perfect information: I may have the ideal X, but not have enough data points to determine that \(X\) is indeed significant. Conversely, I may have lots of data, but not the correct covariate \(X\).
]]>
In the previous article in this series on Joins in MapReduce, we looked at how a traditional join is performed in a distributed map-reduce setting. I next want to generalize the idea of a join: typically we think of joining rows on equal keys, but more generally, you could join two rows on any function that returns a boolean: \( l, r \mapsto \theta(l,r) \). We call these generalized joins \(\theta\)-joins, or theta-joins. Because of the arbitrariness of the \(\theta\) function, theta-joins can be incredibly more computationally challenging. Here are some examples of theta-joins:
R.key < L.key
|R.key - L.key| < 2
R.key1 = L.key1 and R.key2 = L.key2
And then, implementing a theta-join in mapreduce is even more challenging! If you recall the previous join algorithm, where we implemented an equality-join hence \(\theta\) equaled the equality operator, we used an implicit trick: rows with the same key on the lefthand side and the righthand side would map to the same reducer in the groupByKey
step. Really, no joining is actually done pre-reducer, just organizing. So how do we implement a theta-join in mapreduce?
Unfortunately, the problem is not generally solved. For very specific join types, there might exist very fragile tricks that one can use, but they are not generalizable. In order to accommodate any join, you'll need to check all elements of the RHS (righthand side) against all elements of the LHS (lefthand side). This is inefficient or just generally crappy, so how can we reduce the crappiness? The authors of a pretty accessible paper, Processing Theta-Joins using MapReduce, present an efficient way to partition this cross-product (what they call the join-space) across the reducers. What's unique about this algorithm is because of it's randomness (yes, it's stochastic), it handles the skew-problem common in mapreduce joins gracefully.
Below is a PySpark implementation of the 1-Bucket-Theta join:
And some example usage using datetimes, this might be called an interval-join as I'm searching for rows that have a datetime that falls into a interval.
Sorry, but the above implementation is just too inefficient: that join-space I mentioned above, which is the cross-product of the datasets, is simply too big. Most rows will not be included in the result set, yet we check them anyways. This is the burden of the theta-join. Though I don't follow suite, the paper does examine gains by using some summary statistics from the LHS and RHS to improve the performance of certain types of theta-joins.
]]>
In traditional databases, the JOIN algorithm has been exhaustively optimized: it's likely the bottleneck for most queries. On the other hand, MapReduce, being so primitive, has a simpler implementation. Let's look at a standard join in MapReduce (with syntax from PySpark).
We start off with two datasets, that might resemble something like below
# left # right
(1, 'A', 'B' ) | (1, 'Z', 'Y' )
(2, 'C', 'D' ) | (1, 'X', 'V' )
(2, 'E', 'F' ) | (2, 'W', 'U' )
(3, 'E', 'F' ) | (4, 'T', 'S' )
We need to specify a key for each record, something that we will use to join on. In PySpark, there is a keyBy
function that allows you to set a key. Below is the resulting output.
left = left.keyBy(lambda r: r[0])
right = right.keyBy(lambda r: r[0])
# left # right
(1, (1, 'A', 'B') ) | (1, (1, 'Z', 'Y' ))
(2, (2, 'C', 'D') ) | (1, (1, 'X', 'V' ))
(2, (2, 'E', 'F') ) | (2, (2, 'W', 'U' ))
(3, (3, 'E', 'F') ) | (4, (4, 'T', 'S' ))
Cool. Next we will add some values to denote the origin of the record: 1 if it is from left, and 2 if it is from right.
left = left.map(lambda (k, v): (k, (1, v)))
right = right.map(lambda (k, v): (k, (2, v)))
# left # right
(1, (1, (1, 'A', 'B')) ) | (1, (2, (1, 'Z', 'Y')) )
(2, (1, (2, 'C', 'D')) ) | (1, (2, (1, 'X', 'V')) )
(2, (1, (2, 'E', 'F')) ) | (2, (2, (2, 'W', 'U')) )
(3, (1, (3, 'E', 'F')) ) | (4, (2, (4, 'T', 'S')) )
# We now have (key, (origin, values)) tuples
Next we add the datasets together, using the union
function.
unioned = left.union(right)
# unioned
(1, (1, (1, 'A', 'B')) ) # key = 1
(2, (1, (2, 'C', 'D')) ) # key = 2
(2, (1, (2, 'E', 'F')) ) # key = 2
(3, (1, (3, 'E', 'F')) ) # key = 3
(1, (2, (1, 'Z', 'Y')) ) # key = 1
(1, (2, (1, 'X', 'V')) ) # key = 1
(2, (2, (2, 'W', 'U')) ) # key = 2
(4, (2, (4, 'T', 'S')) ) # key = 4
Up to now, we've really only been mapping. Next we want to start reducing. We start by performing a groupByKey
on the dataset, which will send items with the same key to the same reducer. The group by key will also add up all records with the same key into a list, as we see below:
grouped = unioned.groupByKey() # grouped (each key goes to a different reducer) (1, [(1, (1, 'A', 'B')), (2, (1, 'Z', 'Y')), (2, (1, 'X', 'V'))]) (2, [(1, (2, 'C', 'D')), (1, (2, 'E', 'F')),
(2, (2, 'W', 'U'))])
(3, [(1, (3, 'E', 'F'))])
(4, [(2, (4, 'T', 'S'))])
We know have (key, [list of all elements with that key])
Oh my god we're almost done! Finally, we take each array and map a function that will perform a mini-cross product:
result = grouped.flatMapValues(lambda x : dispatch(x))
# Take each value (the list in the second position), and perform the dispatch function (below) on it.
# For example, let's look at the first element of grouped, i.e. key=1:
def dispatch(seq):
left_origin = []
right_origin = []
for (n, v) in seq:
if n == 1:
left_origin.append(v)
elif n == 2:
right_origin.append(v)
return [(v, w) for v in left_origin for w in right_origin]
For example:
> d = [(1, (1, 'A', 'B')), (2, (1, 'Z', 'Y')),(2, (1, 'X', 'V'))]
> dispatch(d)
[
((1, 'A', 'B'), (1, 'Z', 'Y')),
((1, 'A', 'B'), (1, 'X', 'V'))
]
Which is the correct result for joining left 1 keys against right 1 keys.
This is a neat way to perform a distributed map-reduce join. But there are limitations:
This latter problem is called a \(theta\)-join, where in the case of an equality join \(theta\) is the equality operator. We'll explore how to solve both these problems at once in the next blog post.
]]>I've enumerated only four, but there are many others. Below is a video explaining one reason why we see powerlaw distributions.
I really like this video because it gives a very intuitive reason why powerlaws can exist. It also cements that the winners become winners scheme will create powerlaws and extreme inequalities.
It's tempting to go in reverse: given you see a powerlaw, the distribution must have been generated by a winners become winners scheme. Unfortunately, this is not true: there are other schemes that can create powerlaws. But I would go as far as to say that most empirical powerlaws we see are generated by winners become winners. The trick is to find out why winners become winners.
The above gives a good generating scheme of how powerlaws arise for quantities (books sales, wealth, followers), but what about durations? There exist durations that are powerlawed: response times in server requests are powerlawed, ages of technologies are powerlawed. The generating scheme winners become winners doesn't carry over well to the time domain. Is there an analogy we can use?
Update: Françoise P., in the comments, offers an interesting solution. Packet arrival time in networks often follow a powerlaw. There is another case, similar to this. Consider a random walk over the integers, starting at 0 and with a 0.5 chance of moving up one, and a 0.5 chance of moving down one. The distribution of the duration between returning to 0 is powerlawed: the probability of a duration of length n is proportional to one over 2 to the power of n.
]]>If you have been following this blog, you'll know that I employ Bayesian A/B testing for conversion tests (and see this screencast to see how it works). One of the strongest reasons for this is the interpretability of the analogous "p-value", which I call Confidence, defined as the probability the conversion rate of A is greater than B,
$$\text{confidence} = P( C_A > C_B ) $$
Really, this is what the experimenter wants - an answer to: "what are the chances I am wrong?". So if I saw a confidence of 1.0 (or as I report it, 100%), I would have to conclude that I know for certain which group is superior.
In practice, I've encountered a 100% confidence a few times, most often in two situations: when my sample size is very low, and when my sample size is very high. The latter situation makes sense: I have lots of data, enough so my testing algorithm can confidently conclude a winner. The former, when I have a small sample size, makes less sense - at first. Suppose the difference between the groups is small, then it doesn't make sense for my algorithm to be so certain in a winner. Instead what is happening is a break down between theory and practice.
Let's go back to the beginning. What are the assumptions of A/B testing. Luckily, they are quite short:
1. Before any treatment, your groups are the same.
That's it! The A/B testing algorithm proceeds by applying a treatment to one of the groups and you measure the results in both groups. In practice however, our groups are not the same. The best we can do is have them "statistically" equal, that is to say, equal on average. In practice, we need to rely on the Law of Large Numbers to make our groups statistically equal. Unfortunately, the LLN works for large sample sizes only - you can't assume equal, or even similar, populations with small sample sizes.
With small sample sizes, and even with medium sample sizes, your groups will be different and there is nothing you can do about it. Perhaps just by chance, more people who were going to convert were put in Bucket A, and thus the treatment's effect is dwarfed by this chance event. This is the class imbalance problem. This is why I was seeing 100% confidence with small sample sizes: by chance a bunch of converters were put into a single bucket and my A/B test algorithm was assuming equal populations.
Let's make this more formal. Define delta, denoted \(\delta\) as the observed difference between the two groups. \(\delta\) can be broken down into two parts:
$$ \delta = \delta_{ci} + \delta_{T}$$
where \(\delta_{T}\) is the difference in conversion rates due to the treatment, and \(\delta_{ci}\) is the difference due to the class imbalance problem. I'm curious about the variance in the observed \(\delta\), that is, how extreme might my observed \(\delta\) be. Taking the variance of both sides:
$$ \text{Var}(\delta) = \text{Var}(\delta_{ci}) + \text{Var}(\delta_{T}) $$
(As the two terms on the right hand side are independent, there will be no correlation term).
Let's do an A/A test: that is, let's not apply a treatment to either group. Then \(\delta_{T}=0\). So any variance we see in the observed delta is due purely to variance from the class imbalance problem. Hence in an A/A test:
$$ \text{Var}(\delta) = \text{Var}(\delta_{ci})$$
To find the variance, we'll perform some Monte Carlo simulations of a typical A/A test. Here are the steps:
Below is the Python code for this simulation:
Below is a grid of the observed variances, varying \(p\) and \(N\):
White is good: it implies the LLN has taken over and there is little variance in the observed \(\delta\). This is where an A/B experiment works: observed difference in the groups are highly likely to be the result of the treatment. If this is hard to read due to the lack of contrast, below is the log-plot of the above:
It means you have to distrust A/B results will small sample sizes, at least if the variance of the class imbalance problem is of the same order of magnitude as the variance of the treatment effect.
Furthermore, this is another reason why we see these huge "unicorn" experiments (ugh, this is term coined by a recent article that claimed 1000% improvement from an experiment). Suppose there is a positive increase from treatment and that group was assigned significantly more converters just by chance (and the sample size is small enough to allow this), then your estimate of "lift" is going to be much larger than it actually is. Compound this with mistaking the binary problem with the continuous problem, and you've got the recipe for the 1000% improvement fallacy.
Hopefully, you've learned from my mistake: don't believe anything with small sample sizes, especially if your treatment effect is very small. You can use the above figures to determine the variance at different population sizes and conversion probabilities (how do you know the conversion probability? Well, you don't - best to estimate it with the total conversion rate: total converts / total population). If the variance at that point is too high, then likely your observed delta is being influenced strongly by the class imbalance problem and not the treatment effect. You can find a csv of the data in the gist here.
Thanks to Chris Harland and Putra Manggala for their time discussing these ideas.
]]>It's clear that humans are irrational, but how irrational are they? After some research into behavourial economics, I became very interested in Prospect Theory, a modern theory about human behaviours. Part of Prospect Theory is determining how our brains value different outcomes under uncertainty, that is, how expected values are mentally calculated. A very interesting part of Prospect theory is that it is not probabilities that are used in the calculation of expected value:
Here, the q's are not the probabilities of outcome z, but it is from another probability measure called decision weights that humans actually use to weigh outcomes. Using a change of measure, we can observe the relationship between the actual probabilities and the decision weights:
My interest is in this change of measure.
Suppose you have two choices:
Which would you prefer?
Well, under the real world probabilty measure, these two choices are equal: .99 * 101 = .01 * 10000. Thus a rational human would be indifferent to either option. But an actualhuman would have a preference: they would see one more valuable than the other. Thus:
rewritten:
and dividing:
What's left to do is determine the direction of the first inequality.
So I created combinations of probabilities and prizes, all with equal real-world expected value, and asked Mechanical Turkers to pick which one they preferred:
The original HIT data and the python scripts that generate are below, plus the data that I just now recieved back from MTurk. Each pair of lotteries received 10 turkers votes.
Note: The Turking cost me $88.40, if you'd like to give back over Gittip, that would be great =)
Note: I called the first choice Lottery A and the second choice Lottery B.
Below is a slightly inappropriate heatmap of the choices people made. If everyone was rational, and hence indifferent to the two choices, the probabilities should hover around 0.5. This is clearly not the case.
What else do we see here?
I recorded my steps to take the raw Turker data, and create the visualization above:
Why did I ask the Turkers to deeply imagine winning $50 dollars before answering the question? This was to offset a potential anchoring effect: if a Turkers first choice had prize $10 000, then any other prize would have looked pitiful, as the anchor had been set at $10 000. By having them imagine winning $50 (lower than any prize), then any prize they latter saw would appear better than this anchor.
Next steps? I'd like to try this again, with more control over the Turkers (have a more diverse set of Turkers on it).
Can I see the code and raw data?Sure, here it is on Github!
One of the first data science books, though it wasn't labelled that at the time, was the excellent book "Freakonomics" (2005). The authors were the first to publicise using data to solve large problems, or to use data to discover novel details about society. The one chapter that I still deeply remember is their chapter on using first names from the US Census data to understand how first names change over time.
For example, the first name "Sandra" has been a very popular name over the past century, while "Brittany" has a brief period of intense popularity during the 80s and early 90s before fading out. What are the results of this difference? If we extrapolate the year of birth until now, then most "Brittany"s now are between the ages of 15-29. In fact, you can be almost certain they are between these ages. On the other, "Sandra" has had a more sustainable popularity, but the name peaked in 1947 and has declined since. Below we plot the current ages of everyone in the US with first names "Brittany" or "Sandra"
Freakonomics was the first book to bring this really cool idea, using first names to determine ages, to light. I was so inspired that I rushed out and gathered all the census data so I could play around with this myself. This inspiration finalized in the Demographica library (Latin for "population-drawing"), an easy way to query this data and answer questions about first names and age. Ex:
To do this, we can just look at the most common names in the 85 years and over bucket:
And what are the most common boy names these days? Similar to above: Jacob, Mason, Noah, Ethan, William, Michael and Jayden. What about your age? What are the most common names of people in your age bucket?
First of all, I was surprised this worked. I was curious about finding all the names that were really hot right now, or about to be really hot. For example, "Channing" is really hot right now:
I tried to think of some statistical ways to do this: compute the rate-of-change of the graph and determine some statistical thresholds, or create some complicated if-statements, etc. I didn't feel these were inspirational ideas. The idea of k-means crossed my mind, but I don't like using k-means on time series data - it just feels wrong. But, I did it anyways. Using some Pandas magic, I normalized each Male name to a probability distribution over the age buckets, and removed the bottom 90% of all names: this resulted in a list of 1300 popular names and their age distributions. (Why remove the bottom names? Name popularity are power-lawed - most names don't have enough data to form an accurate probability distribution.)
Using scikit-learn and its K-Means implementation, I clustered the distributions into 12 clusters (arbitrarily picked 12). Here are the resulting centers of the clusters:
There's some pretty neat stuff happening here:
You may laugh at that last list, until your own daughter brings home a Bently to meet the family
Here's my raw IPython code if you want to reproduce the above analysis (I used %hist in my terminal to generate this)
Me too, so I did the same analysis for Female names, below:
My friend who I was hacking on this with asked "Find some names that were massively popular for a very short period of time". To do this, I increased the value of k to 35, and plotted the cluster centers (Why? I knew a name like that would have a very spiked distribution, and if I plotted it it would stand out among the others - but I needed more granularity in the clusters to find it. So I increased k to 35): /p>
Bingo. That peaked distribution contain the names we want. Only two names were in that list: Catina and Katina. Both were massively popular about 40 years ago. A quick google search on Katina (the more popular to the two) determined the name was used for a newborn baby on the soap opera 'Where the Heart Is' in 1972.
Demographica is good for looking at age distributions of certain names but has another cool feature: population age inference. Let's say you run a small ecommerce website, and each order has an associated name with it. You are probably curious about the age distribution of your customers. The next steps of Demographica will enable the user to infer age distributions of their user. How do to this?
That's the idea, at least. I'll report back with how it goes!