As I was developing lifelines, I kept having a feeling that I was gradually moving the library towards prediction tasks. lifelines is great for regression models and fitting survival distributions, but as I was adding more and more flexible parametric models, I realized that I really wanted a model that would predict the survival function — and I didn't care how. This led me to the idea to use a neural net with \(n\) outputs, one output for each parameter in the cumulative hazard, \(H(t)\). lifelines has implemented the linear version of this, where each parameter in the cumulative hazard is a linear function of the variables:
Linear:
$$ H(t\;|\;x) = f(a, b, c) \\a = a(x \cdot \beta_1^T) \\b = b(x \cdot \beta_2^T) \\c = c(x \cdot \mathbf{\beta}_3^T) $$
Neural Net (NN):
$$ H(t\;|\;x) = f(a, b, c) \\a = \text{NN}(x)_1 \\b = \text{NN}(x)_2 \\c = \text{NN}(x)_3 $$
If my NN has no hidden layer, it's equivalent to the linear case. Great, so using a NN and a function \(f\) provides us a good way to predict survival functions. Can we do better? Well, we are stuck with a particular \(f\), which may or may not be flexible enough to capture all the nuances of possible survival. For example, SaaS churn has very unique survival behavior that a traditional parametric model couldn't capture well.
Thinking more about that blog post linked above, I realized that I could use a piecewise approach and split my timescale into partitions as fine as I want, and then predict a constant hazard in each interval. Graphically, something like this:
Each output predicts the hazard in a time interval. This piecewise constant hazard can form the cumulative hazard, and then the survival function.
As the NN increases in depth and complexity, each output benefits from the information in the other time periods. For example, if the model thinks that a certain generated feature is really important, it's exposed for all outputs to use.
What about those taus — which I call breakpoints — how are they determined? There are a few options. The first option is the manually specify them. This is okay if we have apriori knowledge of important times, but often we don't. A second option is to expand the neural net to return breakpoints as well. Unfortunately I haven't gotten this model to converge well. The final option considered is to use the dataset to choose breakpoints. In the Kaplan Meier model, the time points of change in the survival function are equal to the unique event times. More generally, the density of the changes in survival is proportional to the density of observed event times. Using this, we can programmatically choose good breakpoints by first creating a density estimate of observed event times, and spacing out breakpoints in proportion to high density. That is, we solve \(p=F_T(t)\) for \(t\) at evenly spaced \(p\) values. Below is an example CDF and how breakpoints are determined:
The number of breakpoints is more arbitrary, and can be manually set or heuristically set (like proportional to the square root the number of observed event times).
So, to recap: we choose breakpoints in proportion to density of event times (which is where the most information is), and use a neural net to determine a subject's hazard in that interval. This is exactly what my new Python library, lifelike, implements. At the moment, I am building upon the computational library Jax. lifelike's API is similar to Keras, and users familiar with Keras (and Jax) could jump in immediately. The library is also quite opinionated, and based on my own philosophy on survival analysis. For example:
1. The predicted survival function is returned, and no effort is made to compute a point estimate. This aligns with my belief that a distribution has enormous advantages (and on the other side of that coin: point estimates are overrated). Users can compute their own points estimates if they wish
2. Fitting is done against the log-likelihood loss. The survival analysis log-likelihood contains all the information around censoring and observed event times. Is it interpretable? Not really, but most losses are not anyways. The goal is to minimize loss, and I believe the log-likelihood is a terrific loss.
Let's see some inference done with lifelike. The model was trained on an artificial SaaS churn dataset. No prior knowledge about the breakpoints were provided. Below we see the predicted survival function and predicted hazard of a subject.
The model was correctly able to find the periods of high SaaS customer churn, and predict the subject's survival function.
This NN piecewise-constant model isn't new, and has been published recently in literature before. However, almost all the literature has a very shallow network, and is fit using BFGS. lifelike is the first implementation in a modern computational graph framework, and leaning on Jax means we get autodiff, JIT and GPUs for free. We can scale this model is any dataset size or network we wish (limited only by Jax). For example, my upcoming research could be using images from spectrometers as an input to NNs — lifelike theoretically can handle that.
I hope this blog article inspired you or gave you some new ideas. lifelike is very experimental, and I'm still iterating on APIs and functionality, so use at your own risk. Stay tuned for more updates!
]]>I’ve started growing yeast in my closet-turned-laboratory. There’s a reason why I am growing yeast, but that’ll be for another post. For this experiment, I wanted to use my new hemocytometer to do cell counts periodically over the next few days to gather data.
A nutrient-rich bioreactor (an Erlenmeyer flask with wort) was left at room temperature with plenty of aeration (a magnetic stirrer) for about 2.5 days. My collected data is below.
]]>A nutrient-rich bioreactor (an Erlenmeyer flask with wort) was left at room temperature with plenty of aeration (a magnetic stirrer) for about 2.5 days. My collected data is below.
| hour | cell count |
|------|------------|
| 0 | 20 |
| 12.5 | 21 |
| 17.5 | 28 |
| 23 | 34 |
| 36.5 | 34 |
| 42.5 | 31 |
| 48 | 32 |
| 65 | 32 |
As I mentioned in my previous post, counting cells is a very noisy process, so we want to keep that in mind as we analysis this data. We have two options to proceed:
I like option two, as it’s more of a challenge, and it can be used for interpolation within the data points and extrapolation outside of the observed data points.
Typically microorganism growth after inoculation has 3 phases: lag-phase, log-phase and stationary phase. The lag-phase is a period where the organisms become accustomed to their new environment, and have low reproduction. The log-phase is a poorly-named phase that represents the period of high (exponential) reproduction. Finally, after the medium has been depleted or the organism concentration has become too high, the organisms stop reproducing and they enter the stationary phase. This type of growth looks a lot like logistic growth, so let’s use that model:
$$(\text{yeast/mL})_t = P_0 + \frac{K}{1 + \exp(-r\cdot(t - \delta))}$$
The lag-phase is modeled by the \(\delta\) parameter - the larger this is, the longer the lag-phase went on for. From sources, the lag-phase usually lasts less than 24h. Given my initial bioreactor conditions, a lookup table suggests I should see about 50% growth. Furthermore, using the volume of the medium and the estimated concentration of my inoculate, I have an estimate for my initial concentration. All these give me priors for my estimates.
We can model this in PyMC3 like so (this is modified code from my previous yeast-counting blog article).
import pymc3 as pm
MILLION = 1e6
TOTAL_SQUARES = 25
SQUARES_COUNTED = 4
yeast_counted = np.array([20, 21, 28, 34, 34, 31, 32, 32])
hours_since_inoc = np.array([0, 12.5, 17.5, 23, 36.5, 42.5, 48, 65])
n_obs = yeast_counted.shape[0]
def logistic(t, K, r, delta_t):
return K / (1 + np.exp(-r * (t - delta_t)))
with pm.Model() as model:
K = pm.Normal("K", mu=50 * MILLION, sd=25 * MILLION) # about 50% growth was expected
P0 = pm.Normal("P0", mu=100 * MILLION, sd=25 * MILLION)
r = pm.Exponential("r", lam=2.5)
delta_t = pm.Uniform("delta_t", lower=0, upper=24) # lag phase stops in the first 24 hours
yeast_conc = P0 + logistic(hours_since_inoc, K, r, delta_t)
shaker1_volume = pm.Normal("shaker1 volume (mL)", mu=9.0, sigma=0.05, shape=n_obs)
shaker2_volume = pm.Normal("shaker2 volume (mL)", mu=9.0, sigma=0.05, shape=n_obs)
yeast_slurry_volume = pm.Normal("initial yeast slurry volume (mL)", mu=1.0, sigma=0.01, shape=n_obs)
shaker1_to_shaker2_volume = pm.Normal("shaker1 to shaker2 (mL)", mu=1.0, sigma=0.01, shape=n_obs)
dilution_shaker1 = pm.Deterministic("dilution_shaker1", yeast_slurry_volume / (yeast_slurry_volume + shaker1_volume))
final_dilution_factor = pm.Deterministic("dilution_shaker2", dilution_shaker1 * shaker1_to_shaker2_volume / (shaker1_to_shaker2_volume + shaker2_volume))
volume_of_chamber = pm.Gamma("volume of chamber (mL)", mu=0.0001, sd=0.0001 / 20)
# why is Poisson justified? in my final shaker, I have yeast_conc * final_dilution_factor * shaker2_volume number of yeast
# I remove volume_of_chamber / shaker2_volume fraction of them, hence it's a binomial with very high count, and very low probability.
yeast_visible = pm.Poisson("cells in visible portion", mu=yeast_conc * final_dilution_factor * volume_of_chamber, shape=n_obs)
number_of_counted_cells = pm.Binomial("number of counted cells", yeast_visible, SQUARES_COUNTED/TOTAL_SQUARES, observed=yeast_counted, shape=n_obs)
trace = pm.sample(2000, tune=20000)
We are mostly interested in our posteriors for the parameters of the growth model.
Actually, that’s not true: we aren’t really interested in the parameters. Most don’t have an easy interpretation (except for delta_t
). What we are really interested in is the posterior of the growth curve. Recall this is a distribution. To demonstrate this, we can sample from the parameters’ posteriors and drop those values into the growth curve. For example, if we sampled 4 times:
Each of these curves look very different, which should give us pause when we make inference about our growth. What happens if we keep sampling growth curves, and then average over all of them - what does that curve look like? What about the error bars on that? This is easy to do (and part of the reason I appreciate Bayesian computation). On the same graph, I’m also going to plot hundreds of potential realizations, as it’s important not to get too focused on the mean as being the “truth” - there is still lots of variation!
Lovely. We can see an estimate for our initial concentration was about 122M ± 18M yeast/mL, and our final concentration was about 191M ± 15M yeast/mL. Eyeballing, the lag phase stopped just prior to 8h (though we need to reconcile this with the posterior mean of the delta_t
is 13.5).
What was our estimated yield? I mentioned previously that 50% is expected - how did we do? This computation is also easy in Bayesian inference, we just look at the distribution of the ratio of yeast/mL at time 60 over yeast/mL at time 0. Doing this, the average value of this distribution is 59%.
This was a fun little experiment to mix statistics and biology. Further extensions include adding covariates, and modeling death of cells too (there is a finite amount of energy in the medium, so this should happen eventually). For now, however, I’m washing the yeast and eventually going to dehydrate them.
]]>The first step is to dilute your sample. The original concentration could be as high as billions per mL, so even a fraction of a mL is going to be still too concentrated to properly count. This is solved by successive dilutions of the original sample:
Each time we transfer to a new test tube, we dilute the sample by a factor of 10.
After this is done, a very small amount of the final diluted slurry is added to a hemocytometer. A hecocytometer is a glass slide with a visible grid that an observer can place under a microscope and count cells on. Under the microscope, it looks like this (where yellow dots represent cells):
Source: https://di.uq.edu.au/community-and-alumni/sparq-ed/sparq-ed-services/using-haemocytometer
The hemocytometer is designed such that a known quantity of volume exists under the inner 25 squares (usually 0.0001 mL). The observer will apriori pick 5 squares to count cells in, typically the 4 corner squares and the middle square. (Why only 5? Unlike this image above, there could be thousands of cells, and counting all 25 squares would take too long). Since we know the volume of the 25 squares, and the dilution rate, we can recover our original slurry concentration:
$$ \text{cells/mL} = (\text{cells counted}) \cdot 5 \cdot (\text{dilution factor}) / 0.0001 \text{mL} $$Given I counted 49 yeast and my dilution was 1000x, my estimate is 2.45B yeast/mL. Great! We are done, right? No way, consider all the sources of uncertainty we glossed over:
So we should expect high variance in our estimate because of the many sources of noise, and because they are layered on top of one another. Let’s redo this with Bayesian statistics so we can model our uncertainty.
Here’s the code for an observation of 49 yeast cells counted. Each source of noise is a random variable. I’ve added some priors that I think are sensible.
import pymc3 as pm
BILLION = 1e9
TOTAL_SQUARES = 25
squares_counted = 5
yeast_counted = 49
with pm.Model() as model:
yeast_conc = pm.Normal("cells/mL", mu=2 * BILLION, sd=0.4 * BILLION)
shaker1_volume = pm.Normal("shaker1 volume (mL)", mu=9.0, sd=0.05)
shaker2_volume = pm.Normal("shaker2 volume (mL)", mu=9.0, sd=0.05)
shaker3_volume = pm.Normal("shaker3 volume (mL)", mu=9.0, sd=0.05)
yeast_slurry_volume = pm.Normal("initial yeast slurry volume (mL)", mu=1.0, sd=0.01)
shaker1_to_shaker2_volume = pm.Normal("shaker1 to shaker2 (mL)", mu=1.0, sd=0.01)
shaker2_to_shaker3_volume = pm.Normal("shaker2 to shaker3 (mL)", mu=1.0, sd=0.01)
dilution_shaker1 = yeast_slurry_volume / (yeast_slurry_volume + shaker1_volume)
dilution_shaker2 = shaker1_to_shaker2_volume / (shaker1_to_shaker2_volume + shaker2_volume)
dilution_shaker3 = shaker2_to_shaker3_volume / (shaker2_to_shaker3_volume + shaker3_volume)
final_dilution_factor = dilution_shaker1 * dilution_shaker2 * dilution_shaker3
volume_of_chamber = pm.Gamma("volume of chamber (mL)", mu=0.0001, sd=0.0001 / 20)
# why is Poisson justified? in my final shaker, I have yeast_conc * final_dilution_factor * shaker3_volume number of yeast
# I remove volume_of_chamber / shaker3_volume fraction of them, hence it's a binomial with very high count, and very low probability.
yeast_visible = pm.Poisson("cells in visible portion", mu=yeast_conc * final_dilution_factor * volume_of_chamber)
number_of_counted_cells = pm.Binomial("number of counted cells", yeast_visible, squares_counted/TOTAL_SQUARES, observed=yeast_counted)
trace = pm.sample(5000, tune=1000)
pm.plot_posterior(trace, varnames=['cells/mL'])
The posterior for cells/mL is below:
We can see that the width of the credible interval is about a billion. That’s surprisingly large. And that makes sense: we only have a single, noisy observation. We can see the influence of the prior here as well. Note that the posterior’s mean is about 2.25B, much closer to the priors’ mean of 2B than our naive estimate of 2.45B above. This brings up another point: using the naive formula above is like saying: “I observed a single coin flip, saw heads, so all coin flips are heads.” Sounds silly, but it’s identical inference. With Bayesian statistics, we get to lie in a much more reassuring bed!
Read the next article in this series, modelling growth.
]]>kmf = KaplanMeierFitter().fit(df['T'], df['E'])
kmf.plot(figsize=(11,6));
To borrow a term from finance, we clearly have different regimes that a customer goes through: periods of low churn and periods of high churn, both of which are predictable. This predictability and "sharp" changes in hazards suggests that a piecewise hazard model may work well: hazard is constant during intervals, but varies over different intervals.
Furthermore, we can imagine that individual customer variables influence their likelihood to churn as well. Since we have baseline information, we can fit a regression model. For simplicity, let's assume that a customer's hazard is constant in each period, however it varies over each customer (heterogeneity in customers).
Our hazard model looks like¹: $$ h(t\;|\;x) = \begin{cases} \lambda_0(x)^{-1}, & t \le \tau_0 \\ \lambda_1(x)^{-1} & \tau_0 < t \le \tau_1 \\ \lambda_2(x)^{-1} & \tau_1 < t \le \tau_2 \\ ... \end{cases} $$
and \(\lambda_i(x) = \exp(\mathbf{\beta}_i x^T), \;\; \mathbf{\beta}_i = (\beta_{i,1}, \beta_{i,2}, ...)\). That is, each period has a hazard rate, \(\lambda_i\), that is the exponential of a linear model. The parameters of each linear model are unique to that period - different periods have different parameters (later we will generalize this).
Why do I want a model like this? Well, it offers lots of flexibility (at the cost of efficiency though), but importantly I can see:
One interesting point is that this model is not an accelerated failure time model even though the behaviour in each interval looks like one. This is because the breakpoints (intervals) do not change in response (contract or dilate) to the covariates (but that's an interesting extension).
¹ I specify the reciprocal because that follows lifelines convention for exponential and Weibull hazards. In practice, it means the interpretation of the sign is possibly different.
pew = PiecewiseExponentialRegressionFitter(
breakpoints=breakpoints)\
.fit(df, "T", "E")
Above we fit the regression model. We supplied a list of breakpoints that we inferred from the survival function and from our domain knowledge.
Let's first look at the average hazard in each interval, over time. We should see that during periods of high customer churn, we also have a high hazard. We should also see that the hazard is constant in each interval.
fig, ax = plt.subplots(1,1)
kmf.plot(figsize=(11,6), ax=ax);
ax.legend(loc="upper left")
ax.set_ylabel("Survival")
ax2 = ax.twinx()
pew.predict_cumulative_hazard(
pew._norm_mean.to_frame(name='average hazard').T,
times=np.arange(0, 110),
).diff().plot(ax=ax2, c='k', alpha=0.80)
ax2.legend(loc="upper right")
ax2.set_ylabel("Hazard")
It's obvious that the highest average churn is in the first few days, and then high again in the latter billing periods.
So far, we have only been looking at the aggregated population - that is, we haven't looked at what variables are associated with churning. Let's first start with investigating what is causing (or associated with) the drop at the second billing event (~day 30).
fig, ax = plt.subplots(figsize=(10, 4))
pew.plot(parameter=['lambda_2_'], ax=ax);
From this forest plot, we can see that the var1
has a protective effect, that is, customers with a high var1
are much less likely to churn in the second billing periods. var2
has little effect, but possibly negative. From a business point of view, maximizing var1
for customers would be a good move (assuming it's a causal relationship).
We can look at all the coefficients in one large forest plot, see below. We see a distinct alternating pattern in the _intercepts
variable. This makes sense, as our hazard rate shifts between high and low churn regimes. The influence of var1
seems to spike in the 3rd interval (lambda_2_
), and then decays back to zero. The influence of var2
looks like it starts to become more negative over time, that is, is associated with more churn over time.
fig, ax = plt.subplots(figsize=(10, 10))
pew.plot(ax=ax);
If we suspect there is some parameter sharing between intervals, or we want to regularize (and hence share information) between intervals, we can include a penalizer which penalizes the variance of the estimates per covariate.
Note: we do not penalize the intercept, currently. This is a modelers decision, but I think it's better not too.
Specifically, our penalized log-likelihood, \(PLL\), looks like:
$$ PLL = LL - \alpha \sum_j \hat{\sigma}_j^2 $$where \(\hat{\sigma}_j\) is the standard deviation of \(\beta_{i, j}\) over all periods \(i\). This acts as a regularizer and much like a multilevel component in Bayesian statistics. In the above inference, we implicitly set \(\alpha\) equal to 0. Below we examine some more cases of varying \(\alpha\). First we set \(\alpha\) to an extremely large value, which should push the variances of the estimates to zero.
# Extreme case, note that all the covariates' parameters are almost identical.
pew = PiecewiseExponentialRegressionFitter(
breakpoints=breakpoints,
penalizer=20.0)\
.fit(df, "T", "E")
fig, ax = plt.subplots(figsize=(10, 10))
pew.plot(ax=ax);
As we suspected, a very high penalizer will constrain the same parameter between intervals to be equal (and hence 0 variance). This is the same as the model:
$$ h(t\;|\;x) = \begin{cases} \lambda_0(x)^{-1}, & t \le \tau_0 \\ \lambda_1(x)^{-1} & \tau_0 < t \le \tau_1 \\ \lambda_2(x)^{-1} & \tau_1 < t \le \tau_2 \\ ... \end{cases} $$and \(\lambda_i(x) = \exp(\mathbf{\beta_{0,i} + \beta} x^T), \;\; \mathbf{\beta} = (\beta_{1}, \beta_{2}, ...)\). Note the reuse of the \(\beta\)s between intervals.
This model is the same model proposed in "Piecewise Exponential Models for Survival Data with Covariates".
One nice property of this model is that because of the extreme information sharing between intervals, we have maximum information for inferences, and hence small standard errors per parameter. However, if the parameters effect is truly time-varying (and not constant), then the standard error will be inflated and a less constrained model is better.
Below we examine a in-between penalty, and compare it to the zero penalty.
# less extreme case
pew = PiecewiseExponentialRegressionFitter(
breakpoints=breakpoints,
penalizer=.25)\
.fit(df, "T", "E")
fig, ax = plt.subplots(figsize=(10, 10))
pew.plot(ax=ax, fmt="s", label="small penalty on variance")
# compare this to the no penalizer case
pew_no_penalty = PiecewiseExponentialRegressionFitter(
breakpoints=breakpoints,
penalizer=0)\
.fit(df, "T", "E")
pew_no_penalty.plot(ax=ax, c="r", fmt="o", label="no penalty on variance")
plt.legend();
We can see that:
I think, in practice, adding a small penalty is the right thing to do. It's extremely unlikely that intervals are independent, and extremely unlikely that parameters are constant over intervals.
Like all lifelines models, we have prediction methods too. This is where we can see customer heterogeneity vividly.
# Some prediction methods
pew.predict_survival_function(df.loc[0:3]).plot(figsize=(10, 5));
pew.predict_cumulative_hazard(df.loc[0:3]).plot(figsize=(10, 5));
pew.predict_median(df.loc[0:5])
In conclusion, this model is pretty flexible and it is one that can encourage more questions to be asked. Beyond just SaaS churn, one can think of other application of piecewise regression models: employee churn after their stock option vesting cliff, mortality during different life stages, or modelling time-varying parameters.
Future extensions include adding support for time-varying covariates. Stay tuned!
T = [0, 2, 4, 7 ] # in hours
N = [1000, 914, 568, 112]
I’ll present two so-so solutions to finding the half-life, and then one much better solution.
We can plot this over time:
We can eyeball the half-life to be about 4.3h - note that this is a linear interpretation between two points, and thus this method doesn’t consider all the data (i.e. it’s not a global method).
Exponential death is a common model in ecology because of it’s simplicity. We can try to find a \(\beta\) value such that the sum of squares of observed values minus \(1000 \exp(\beta t) \) is minimized. That’s not hard to do with scipy’s minimize functions. The best estimate for \(\beta\) is about 0.17. But, as we can see, it’s not a very good fit:
We could choose a more flexible model (like a Weibull distribution), but we will still run up against a fundamental problem: the variance on the estimates will be very high because we are only considering 4 data points, and adding more parameters to the fitting model will only make things worse.
Think again about how the data was collected for a moment. We took a count, waited a few hours, and then took a count again. The delta in the population are composed of individuals that died sometime in that interval. So what we have is a case of interval censoring. That is, I know that 86 individuals died sometime between 0 and 2 (lower and upper bound), 346 died sometime between 2 and 4 (lower and upper bound), etc. Finally, we are left with 112 that are right censored, that is, we stopped watching and don’t observe their death.
We can use survival analysis to answer this problem (hey, let’s use lifelines!). Importantly, we can treat all 1000 organisms as individual observations to make the inference stronger. First, we’ll do some data manipulation.
df = pd.DataFrame({
"deltas": [86, 346, 456, 112],
"start": [0, 2, 4, 7],
"stop": [2, 4 , 7, np.inf],
"observed_death": [False, False, False, False]
})
print(df)
"""
deltas start stop observed_death
0 86 0 2.0 False
1 346 2 4.0 False
2 456 4 7.0 False
3 112 7 inf False
"""
We have observed_death
as always False
, because we don’t have exact measurements on any of the deaths (that is, I have no organisms where I can say “this guy lived exactly X hours”). Why do I have the infinity in the last row? Well, we don’t observe the final 112 organisms die either, but we know they will die between hour 7 and infinity, so we code that in too.
from lifelines import WeibullFitter
wf = WeibullFitter()
wf.fit_interval_censoring(
df['start'],
df['stop'],
df['observed_death'],
weights=df['deltas']
)
wf.print_summary()
"""
<lifelines.WeibullFitter: fitted with 4 observations, 4 censored>
number of subjects = 4
number of events = 0
log-likelihood = -1182.357
hypothesis = lambda_ != 1, rho_ != 1
---
coef se(coef) lower 0.95 upper 0.95 p -log2(p)
lambda_ 5.09 0.07 4.94 5.23 <0.005 inf
rho_ 2.49 0.08 2.35 2.64 <0.005 282.95
"""
Note the very small standard errors. And if we plot the resulting survival curve, we see an excellent fit.
From this, we can report the half-life, or even better is to present the survival distribution, since it has the most information in it.
Let's say we actually have two cultures we are working with, and take population measurements at the same time. However we have different environments for each of the cultures: one has double the sugar concentration in the medium than the other. We can model this using a regression survival model:
culture_1 = pd.DataFrame({
"deltas": [86, 346, 456, 112],
"start": [0, 2, 4, 7],
"stop": [2, 4 , 7, np.inf],
"conc_sugars": [0.15, 0.15, 0.15, 0.15],
"observed_death": [False, False, False, False]
})
culture_2 = pd.DataFrame({
"deltas": [50, 202, 566, 182],
"start": [0, 2, 4, 7],
"stop": [2, 4 , 7, np.inf],
"conc_sugars": [0.30, 0.30, 0.30, 0.30],
"observed_death": [False, False, False, False]
})
df = pd.concat([culture_2, culture_1])
from lifelines import WeibullAFTFitter
w_aft = WeibullAFTFitter().fit_interval_censoring(df, lower_bound_col='start',
upper_bound_col='stop',
event_col='observed_death',
weights_col='deltas')
w_aft.params_
"""
lambda_ conc_sugars 0.901763
_intercept 1.499147
rho_ _intercept 0.996401
"""
So what is the impact of sugar on microorganisms death? The coefficient associated with sugar concentration in this AFT model is interpreted as "higher means live longer". We get a coefficient of 0.90. Thus organisms with 0.15 sugar concentration "experience" time at a 14.4% faster rate than organisms with 0.30 sugar concentration (Why? \(\exp(0.90 (0.30-0.15))=1.144\) ).
This is a simple case of when you want to use interval censoring instead of more naive ways. By thinking about the data generating process, we gain lots of statistical efficiency and better predictions.
]]>variance_matrix_
property). From this, I can plot the survival curve, which is fine, but what I really want is a measure of lifetime value.
Suppose customers give us $10 each time period for the first 3 time periods, and then $30 each time period afterwards. For a single user, the average LTV (up to timeline
) calculation might look like:
# create a Weibull model with fake data
wf = WeibullFitter().fit(np.arange(1, 100))
from autograd import numpy as np
def LTV(params, timeline):
lambda_, rho_ = params
sf = np.exp(-(timeline/lambda_) ** rho_)
clv = 10 * (timeline <= 3) * sf + 30 * (timeline > 3) * sf
return clv.sum()
timeline = np.arange(10)
params = np.array([wf.lambda_, wf.rho_])
LTV(params, timeline)
# 214.76
This gives me a point estimate. But I also want the variance, since there is uncertainty in lambda-hat and rho-hat. There is a technique called the delta-method that allows me to transform variance from one domain to another. But to use that, I need the gradient of LTV w.r.t to lambda and rho. For this problem, I could probably write out the gradient, but it would be messy and error-prone. I can use autograd instead to do this.
from autograd import grad
gradient_LTV = grad(LTV)
gradient_LTV(params, timeline)
# array([ 0.15527043, 10.78900217])
The output of the above is the gradient at the params
value. Very easy. Now, if I want the variance of the LTV, the delta methods tells me to multiply gradient by the covariance matrix and the gradient’s transpose (check: this should return a scalar and not a vector/matrix):
var = gradient_LTV(params, timeline)\
.dot(wf.variance_matrix_)\
.dot(gradient_LTV(params, timeline))
# 3.127
So, my best estimate of LTV (at timeline=10) is 214.76 (3.127). From this, we can build confidence intervals, etc.
]]>TLDR: upgrade lifelines for lots of improvements
pip install -U lifelines
During my time off, I’ve spent a lot of time improving my side projects so I’m at least kinda proud of them. I think lifelines, my survival analysis library, is in that spot. I’m actually kinda proud of it now.
A lot has changed in lifelines in the past few months, and in this post I want to mention some of the biggest additions and the stories behind them.
The Cox proportional hazard model is the workhorse of survival analysis. Almost all papers, unless good reason not to, use the Cox model. This was one of the first regression models added to lifelines, but it has always been too slow. It’s implemented in Numpy, but there was a tricky for
loop still in Python. I had ideas on how to turn that loop into a vectorized Numpy operation of matrix products, but there would have been an intermediate variable that created a d x d x p
tensor, where d
is the number of independent covariates and p
is the size of some subset of subjects (at worst, p could be equal to the number of subjects). This will quickly explode the amount of memory required and hence performance would degrade.
One night, on Twitter, I noticed some posts about how to use einsum
in PyTorch and Numpy. I had previously heard of it, but didn’t think it was something that was implemented in Numpy nor did I think it was something I could use. Turns out, einsum
is a way to do matrix multiplication without intermediate variables! How? Since a product of matrices is a just multiplications and sums, one can declaratively define what the end product should look like and the internals of einsum
will compute the intermediate operations at the C layer (I’m simplifying and my own understanding of this is shaky). Suffice to say, after racking my brain and lots of trial and error with einsum
, I could replace the tricky Python for
loop with einsum
! This resulted in a 3x performance increase. However, what I've gained in performance, I've lost in some readability of my code. I’ve put some references to einsum
in the bottom of this article.
The second significant improvement to the performance of the Cox model is using a meta-algorithm to select the fastest algorithm. There are two ways to compute the likelihood, and the performance of each is highly dependent on a few characteristics of the dataset, notably how many ties are in the dataset. After running some tests, I noticed that the delineation of when one was faster than the other was not clear, so a heuristic like if num_ties > 10:
would not be very effective. What I did instead was to generate hundreds of artificial dataset of varying ties and varying size, and measured the timing of both algorithms. I then fit a linear model to the ratio of the times, conditioned on the ties and size (and their interaction). It was a surprisingly good fit! So now in lifelines, at runtime, I compute some statistics about the incoming dataset, plug these values into a fitted linear model, and the result is the predicted ratio of timings between the two algorithms. I then choose which algorithm would be faster and continue on. The prediction is super fast (it’s a linear model after all), so there are no performance hits there. With this meta-algorithm, the lifelines Cox implementation is up to 3x faster for some datasets. I wrote up a full summary of the idea in a previous blog post [2].
Overall, the Cox model is now 10x faster than it was a few months ago. (Also, I only mention it here, but the Aalen Additive model is like 50x times faster, but most of those speed improvements where replacing Pandas with NumPy in critical points.)
One large gap in lifelines was checking the proportional hazards assumption, which is critical for any kind of inference-focused modeling (it matters less for prediction tasks). The author of the popular R survival library, Terry Therneau, has made massive contributions to survival analysis techniques, including a statistical test for non-proportionality. This test relies on an important residual of the Cox regression. While implementing this statistical test in lifelines, I realized there was a more general solution for handle all residuals, so I added functionality to compute the most common residuals.
These additions enabled a new, very user friendly function, check_assumptions
, which prints out potential proportionality violations in human readable format and offers advice on how to fix it. I also introduced residuals plots:
I think too many people are focused on deep learning. There, I said it, and probably you agree. However, some really cool technologies are falling out of that area that others can use. One of them is libraries that implement automatic differentiation, aka autodiff. This is like the holy grail for computational statisticians. Let me quickly explain why: given an arbitrary numerical function, you can automatically compute its exact gradient at any (valid) point. No rounding errors. No restrictions on the function. Just gradients. To quote directly from [1]:
/beginquote
Q: What’s the difference between autodiff and symbolic diff?
R: They are totally different. The biggest difference is that autodiff can differentiate algorithms, not just expressions. Consider the following code:
function f(x)
y = x;
for i=1…100
y = sin(x+y);
return y
Automatic differentiation can differentiate that, easily, in the same time as the original code. Symbolic differentiation would lead to a huge expression that would take much more time to compute.
Q: What about non-differentiable functions?
R: No problem, as long as the function is differentiable at the place you try to compute the gradient.
/endquote
So why am I so excited about this? I suffered through a week of frustrations and headaches trying to implement a log-normal survival model by hand. You can see my frustration here [3]. Also, my second-derivative calculations were abysmal, which meant we would be computing unreliable confidence intervals. The whole thing made me depressed. I was pointed to the Python library autograd [4], and after some wrestling, it was like a beam of heaven shown down on me. Computing gradients with autograd is easy, computing second-derivatives is easy, and performance is near identical. I imagined future generalizations and abstractions, and this has radically simplified my code base. One cool idea is that since we know the second derivative exactly, we can compute the variance matrix of the fitted parameters exactly, and we can use the delta method (and autograd) to compute variances of arbitrary functions of those fitted parameters. Here are two worked examples of the cool things you can do in lifelines now:
Autograd also enabled lifelines to implement accelerated failure time models, so users now have three new regression models to play with. I’ve been so happy with autograd that I’ve converted parts of my other Python library, lifetimes, to use it as well.
I’ve given the lifelines docs a serious facelift, probably doubled the amount of content, and edited large parts of it. Overall, I am much happier with the docs now. One addition I made was adding tracking of visitors’ searches on the docs site. This gives me some idea of where users might be confused, or where current docs structure is insufficient.
It's been a productive few months, and I think lifelines is in a good state. More of my attention now is on lifetimes (a new version was just released, by the way) and some other topics. Hope you enjoy lifelines!
[3] https://github.com/CamDavidsonPilon/lifelines/issues/622
]]>CoxTimeVaryingFitter
, is used for time-varying datasets. Time-varying datasets require a more complicated algorithm, on that works by iterating over all unique times and "pulling out" the relevant rows associated with that time. It's a slow algorithm, as it requires lots of Python/Numpy indexing, which gets worse as the dataset size grows. Call this algorithm batch. The second implemented class, CoxPHFitter
, is used for static datasets. CoxPHFitter
uses a simpler algorithm that iterates over every row once only, and requires minimal indexing. Call this algorithm single¹.
The strange behaviour I noticed was that, for my benchmark static dataset, the more-complicated CoxTimeVaryingFitter
was actually faster than my simpler CoxPHFitter
model. Like almost twice as fast.
The thought occurred to me that iterating over all unique times has an advantage versus iterating over all rows when the cardinality of times is small relative to the number of rows. That is, when there are lots of ties in the dataset, our batch algorithm should perform faster. In one extreme limit, when there is no variation in times, the batch algorithm will perform a single loop and finish. However, in the other extreme limit where all times are unique, our batch algorithm does expensive indexing too often.
This means that the algorithms will perform differently on different datasets (however the results should still be identical). And the magnitude of the performances is like a factor of 2, if not more. But given a dataset at runtime, how can I know which algorithm to choose? It would be unwise to let the untrained user decide - they should be abstracted away from this decision.
As a first step, I wanted to know how often the batch algorithm outperformed the single algorithm. To do this, I needed to create datasets with different sizes and varying fractions of tied times. For the latter characteristic, I defined the (very naive) statistic as:
$$\text{n unique times} = \text{frac}\; \cdot \; \text{n rows} $$
Once dataset creation was done, I generated many combinations and timed the performance of both the batch algorithm (now natively ported to CoxPHFitter
) and the single algorithm. A sample of the output is below (the time units are seconds).
N | frac | batch | single |
432 | 0.010 | 0.175 | 0.249 |
432 | 0.099 | 0.139 | 0.189 |
432 | 0.189 | 0.187 | 0.198 |
... | ... | ... | ... |
So, for 432 rows, and a very high number of ties (i.e. low count of unique times) we see that the batch algorithm performance of 0.18 seconds vs 0.25 seconds for the single algorithm. Since I'm only comparing two algorithms and I'm interested in the faster one, the ratio of batch performance to single performance is just as meaningful. Here's all the raw data.
Looking at the raw data, it's not clear what the relationship is between N, frac, and ratio. Let's plot the data instead:
What we can see is that the ratio variable increases almost linearly with N and frac. Almost linearly. At this point, one idea is to compute both the statistics (N, frac) for a given dataset (at runtime, it's cheap), and predict what its ratio is going to be. If that value is above 1, use single algorithm, else use batch algorithm.
We can run a simple linear regression against the dataset of runtimes, and we get the following:
import statsmodels.api as sm
X = results[["N", "frac"]]
X = sm.add_constant(X)
Y = results["ratio"]
model = sm.OLS(Y, X).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: ratio R-squared: 0.933
Model: OLS Adj. R-squared: 0.931
Method: Least Squares F-statistic: 725.8
Date: Thu, 03 Jan 2019 Prob (F-statistic): 3.34e-62
Time: 13:11:37 Log-Likelihood: 68.819
No. Observations: 108 AIC: -131.6
Df Residuals: 105 BIC: -123.6
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.3396 0.030 11.367 0.000 0.280 0.399
N 4.135e-05 4.63e-06 8.922 0.000 3.22e-05 5.05e-05
frac 1.5038 0.041 37.039 0.000 1.423 1.584
==============================================================================
Omnibus: 3.371 Durbin-Watson: 1.064
Prob(Omnibus): 0.185 Jarque-Bera (JB): 3.284
Skew: -0.166 Prob(JB): 0.194
Kurtosis: 3.787 Cond. No. 1.77e+04
==============================================================================
Plotting this plane-of-best-fit:
We can see that the fit is pretty good, but there is some non-linearities in the second figure we aren't capturing. We should expect non-linearities too: in the batch algorithm, the average batch size is (N * frac) data points, so this interaction should be a factor in the batch algorithm's performance. Let's include that interaction in our regression:
import statsmodels.api as sm
results["N * frac"] = results["N"] * results["frac"]
X = results[["N", "frac", "N * frac"]]
X = sm.add_constant(X)
Y = results["ratio"]
model = sm.OLS(Y, X).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: ratio R-squared: 0.965
Model: OLS Adj. R-squared: 0.964
Method: Least Squares F-statistic: 944.4
Date: Thu, 03 Jan 2019 Prob (F-statistic): 2.89e-75
Time: 13:16:48 Log-Likelihood: 103.62
No. Observations: 108 AIC: -199.2
Df Residuals: 104 BIC: -188.5
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.5465 0.030 17.941 0.000 0.486 0.607
N -1.187e-05 6.44e-06 -1.843 0.068 -2.46e-05 9.05e-07
frac 1.0899 0.052 21.003 0.000 0.987 1.193
N * frac 0.0001 1.1e-05 9.702 0.000 8.47e-05 0.000
==============================================================================
Omnibus: 10.775 Durbin-Watson: 1.541
Prob(Omnibus): 0.005 Jarque-Bera (JB): 21.809
Skew: -0.305 Prob(JB): 1.84e-05
Kurtosis: 5.115 Cond. No. 3.43e+04
==============================================================================
Looks like we capture more of the variance (\(R^2\)), and the optical-fit looks better too! So where are we?
This idea was recently implemented in lifelines, and with some other optimizations to the batch algorithm, we see a 60% speedup on some datasets!
¹ ugh, naming is hard.
]]>I've been playing the video game Heroes of the Storm by Blizzard for about 3 years now. I even attended the game's professional scene's championship last year! One part of the game that has attracted me is the constantly accumulating character roster (I'll provide a short summary of the game later in this post). This means that new characters are being added, and existing characters are being tweaked if they are too powerful or not powerful enough. It's this latter feature that I am most interested in, and the purpose of this blog post: how do we know when a character is over or under powered? There are two methods: a wrong method, and a less-wrong method. This blog article's goal is to re-describe the the old, wrong method, and propose a new ideas for character balance. Keep in mind: the estimated values count for less than the description of the idea.
But before we get into the methods, a quick summary of Heroes of the Storm gameplay (you can skip if you are already familiar).
Heroes of the Storm is a 5-player vs 5-player team game. Before the game begins, each player choose a unique character (called heroes in the game) and is combined with four other players to battle against the opposing team. The objective is to destroy the opponents "core", using either the characters' unique abilities or powerful bonuses scattered around the play area (called a "map"). The game ends when either team's core is destroyed.
At the time of writing, there are over 70 characters, each with unique abilities and advantages. A players decision to choose a characters is based on i) their affinity for a character, ii) which characters are strategically advantageous, given what your opponent is playing (called the "meta"), or iii) which characters are over powered. Focusing on this last reason, a character can be over powered / under powered if their abilities are providing too much damage, healing, etc. Blizzard, the game's creator, will address these different power level of characters with weekly balance patches - however with the game constantly changing, there will never really be perfect equality.
Players can upload their games to third-party services, which aggregate all the games players, characters, map and outcomes into valuable statistics. A very influential statistic is the win rate of character i, defined as:
$$\frac{\text{wins}_{i}}{\text{games participated}_i} $$
which is an estimator for:
$$P(\text{win} \;|\; \text{$i$ present in game})$$
Ideally, the win rate of all characters are close to 0.5, as 0.5 is suppose to represent a character that is equally like to win as to lose (and hence "balanced"). Estimates typically range from 0.35 for the most under-powered, to 0.65 for the most over-powered. In fact, the ordering of this estimate over all characters is commonly used by players to find the most over-powered and under-powered characters, and call for "nerfs" or "buffs" respectively to rebalance the character.
But this is wrong. The win rate is only a correlation. It's not taking into account the entire causal picture. Let's explore that next.
The problem with the above estimate, which I call the naive win rate, is that character selection and the chance of winning have common causes. These common causes are called confounders. For example, professional players may a) select certain characters more often than others and b) win more often. In an extreme case, if the best player in the world selected character T almost exclusively, then character T would have a higher than average win rate, but not necessarily due to the character.
Another example: depending on the other characters on an allied team, a) selection of a character changes, and b) probability of winning changes. If character T is often paired together with S because of some pair synergy, then the win rate of T will be influenced by the win rate of S.
Another example: the higher a team's average MMR (a measure of experience) may a) select a "meta" character more often and b) win more often. More specifically, the more experienced a team, the more likely they will recognise the best character to play in response to their opponents character, and because of such experience, are more likely to win the game.
Another example: when a new hero is released, or an existing hero gets major changes, less experienced players may choose them more. This has an effect on character selection, and on the win rate. Another confounder.
The point is: there exist many causes of both character selection and win probabilities. This means that the naive win rate is not necessarily a good measure of balance, but simply a correlation (not causation) between character selection and winning.
In fact, if you are familiar with causal diagrams, this is perhaps what the causal diagram looks like:
The arrow we are interested in is the arrow between character and win, which I call the causal win rate. The strength of this arrow (difference from 0), and its direction (positive or negative) tell us if a character is over or under powered. This tells us "how much more likely is a win if you choose character T instead of character S?"
We have lots of statistical methods to handle the confounders that muddy our naive win rate. The end result is an estimate of the causal win rate. However, causal inference requires you to state assumptions. Here's the big causal assumption, called exchangeability:
Controlling for the map, opponents' average experience, opponents' characters, allied characters, and allied experience, the win rate of those players that did and didn't choose character T would be the same on average if they all chose character T.
We are also going to simplify the problem, which I'm sure is wrong but it's a first approximation. Let's assume that the character in question is chosen last. That is, the decision to choose character T happens with perfect information of all other character selections and map knowledge. We are also assuming no effect modification (which certainly exists). Effect modification is when the effect (causal win rate) can be modified by other factors. For example, we already mentioned above that possible synergies can exists between pairs of characters - that is a form of effect modification. Similarly, the team composition (ex: how many healers the team has, how many damage dealers, etc.) is a form of effect modification. Like I said, we are going to ignore effect modification for now, and will return to it in another blog post.
For our modelling, we will use outcome regression, which is a causal inference method that carries with it lots of modelling assumptions (Note: there are better methods which have less model assumptions). For example: we are assuming a linear relationship between outcome and variables and we are assuming no interactions (effect modifiers).
Our regression will look like:
$$ \begin{align} \text{logit}(\text{win}_i) & = \alpha_1 \text{AlliedVallaPresent}_i \\ & + \alpha_2 \text{AlliedLiLiPresent}_i + ... \\ & + \beta_1\text{AvgAlliedMMR}_i \\ & + \beta_2\text{AvgOppMMR}_i \\ & + \beta_3\text{Map}_i + ... \\ \end{align} $$
where each a coefficient represents a scalar or vector (as in the case of \(\text{Map}_i\)).
Note that (and this is a consequence of our simple model) that all characters in the regression are symmetric. Thus we just need to run a single regression, and the parameters \(\alpha_i\) are all we are interested in.
What does \(\alpha_i\) represent though? Nothing without a reference point. I chose to use the character Kerrigan as the reference character. A positive alpha represents a better causal win rate relative to Kerrigan, a null alpha represents an equal causal win rate relative to Kerrigan, and a negative alpha represents a worse causal win rate relative to Kerrigan. These relative comparisons may seem like a disadvantage, but I chose Kerrigan specifically as she was the median character with respect to causal win rate (so I ran the program once, using some other character, and Kerrigan was the median in that analysis). Truthfully, it doesn't really matter who is the reference character, the ordering between characters is agnostic to the reference character. However, I choose the median character because they create a natural division of who is over-powered and who is under-powered.
The dataset we are going to use is from user-submitted games to hotslogs.com. The time period is 4/26/2017 to 5/9/2017 which is a period inbetween two balance patches. I filtered down to Hero League and Team League games, and sampled 50k games for analysis.
After we fit the regression, we have an estimated value of alpha for each character. Importantly, we also have error estimates for each alpha that we use to form confidence intervals for each alpha. If this confidence interval contains 0, we can conclude that the character is either balanced, or unbalanced too little to detect. Here are the results (clickthrough for a larger image):
The characters Li Li, Lucio, and Probius are the strongest. On the other hand, Medivh, Tassadar and E.T.C. are the weakest. For some evidence that our analysis is not absolutely wrong: 5 of the top 6 characters above were nerfed in the May 10th, 2017 balance patch.
How does our causal win rate estimates compare to the naive win rate?(Clickthrough for a larger image)
The two metrics are positively correlated, which offers evidence that the naive win rate isn't biased too much by the confounders. The characters in the upper-left and lower-right quadrants are characters that the community is probably thinking about wrong. For example, Lt. Morales has a low naive win rate, and the community might suggest she get a buff. However, she likely is over performing in this dataset. Another very interesting thing to point out is that all the support characters are "shifted" about 0.2 units to the right. Historically this period was when the double-support meta (two or more support characters per team) was gaining popularity, perhaps because Blizzard kept supports artificially over powered as the above graph suggests. Note that this support character advantage is hidden if we only looked at naive win rates. This support advantage is not a artefact of the analysis either, as when I did the same analysis for a more recent dataset, the support shift wasn't present any more.
I managed to get a more recent data (as of the time of writing) from July 2018, spanning a balance patch. So we can observe what happens to heroes' win rates as they were buffed or nerfed on the July 25th balance patch. First, some of the nerfed characters:
And buffs: Azmodan and Kael'Thas saw straight buffs, however Hanzo saw some buffs and some nerfs, and his causal win rate stayed about the same.
There are a lot of critiques one could do about this analysis:
What I hope from this article is that Heroes of the Storm players, and players of similar games, rethink how the naive win rate can be biased. However, one thing that I'd like to think about more is how biased are we? We saw that there was a high correlation between the naive and the causal win rate - does this imply we can use the naive win rate in most circumstances? I'm not sure, and it's something I want to explore. An approach is to use a causal inference model with less assumptions, and including effect modification, to see if we still see a high correlation between the two rates.
Sketch code is available on my Github here.
]]>
The authors start by saying that researchers will naively stratify a risk factor, like smoking, by the birthweight (BW) of infants. Researchers noticed that if the risk factor produced low birthweight (LBW) infant more often, then there would be a specific birthweight where infants below this threshold would have a lower mortality if in the riskier group. For example, for infants born below ~2000g, those infants who were exposed to maternal smoking actually had a lower infant mortality rate than those not exposed to maternal smoking. See figure below:
This seems like madness - not smoking is associated with higher infant mortality in some populations! Some researchers have even suggested that:
"the effect of maternal smoking is modified by birth weight in such a way that smoking is beneficial for LBW babies."
Let's investigate what is really going on.
The authors use a large dataset of over 3 million infants. They define LBW as less than 2500g. Their metric of interest, infant mortality, is defined as a death in the first year per 100,000 livebirths. How do they summarise the effect of maternal smoking?
"We used logistic regression to estimate infant mortality rate ratios and 95 percent confidence intervals..."
In other words, they will be using a specific coefficient in a logistic regression model to produce a summary of the effect of maternal smoking on mortality. Specifically, if we let \(M\) denote the presence of maternal smoking (0 or 1) our model might look like:
$$\text{logit}(p_i) = \beta_0 + \beta_1 M_i $$
and so \(\beta_1\) summarizes the effect of maternal smoking. We'll call the above model the "crude" model. Let \(B\) denote birthweight. We call the following model the "adjusted" model:
$$\text{logit}(p_i) = \beta_0 + \beta_1 M_i + \beta_2 B_i$$
The next question to ask is what else to control for in this regression. The authors say:
"control for potential confounders (maternal age, gravidity, education, marital status, race/ethnicity, and prenatal care). Since adjustment for these potential confounders had virtually no effect, we present only the unadjusted rate ratios below."
What does this mean? If we extended the model to:
$$\text{logit}(p_i) = \beta_0 + \beta_1 M_i + \beta_2 B_i + \beta \cdot C_i$$
where \(C\) are other controls mentioned above, this doesn't change the conclusion (or the coefficients - it's not clear), so the author choose to present results without the additional controls.
After running the regression, the crude model produced that \(\beta_1 = 0.438 \pm 0.029 \). (The paper actually uses the rate ratio, which is the exponential of the coefficient in the model - we won't). However, if we use the adjusted model, then \(\beta_1 = 0.049 \pm 0.032 \), a 10-fold decrease. What happened? Why, after conditioning on BW, did we seem to lose so much effect? It gets worse, in fact.
If we stratify the population into LBW and non-LBW (a "stricter" way to control for a continuous variable), and train the crude model on both these sub-group, we see that smoking actually benefits the infants in the LBW sub-group: the rate ratio of maternal smoking less than 1 (the opposite conclusion is found for the non-LBW sub-group).
I'm going to assume the reader is familiar with causal diagrams. I really liked that the authors went through increasingly complex causal diagrams to explain the paradox. I'll do the same. (Diagrams produced by my latest side project, dog).
This is the simplest model of our causal relationships. Smoking causes LBW, but has no direct effect on mortality, except through LBW. A very useful thing I learned from this paper is that a causal diagram has implications that the data should support, and if the data doesn't support it, the causal diagram is invalidated! Let's see how this works using the results from the regressions. If the above causal diagram is true, then smoking and mortality are certainly correlated. And we see this in the crude regression above (the coefficient is not 0). On the other hand, this diagram also implies that conditioning on BW means that smoking and mortality are independent. However, in the adjusted model above, where we do condition for BW, we still see that the coefficient of smoking is still far from 0. Thus, the data suggests that smoking and mortality are dependent, conditioned on BW. Hence, this diagram is wrong.
We had success with our implication test above. Let's try the same test on this new causal diagram. In the above causal diagram, if we condition on smoking, LBW and mortality should be independent. However, in the adjusted model, LBW has a non-zero coefficient. So we can refute the above diagram.
In this new model, we have smoking causes LBW, smoking causes mortality and there is a causal relationship between LBW and mortality. We see that LBW is a mediator for smoking. The authors are more precise and say that smoking is partially mediated through LBW, since there still is a direct effect of smoking on mortality. The authors state that this DAG is actually the simplest model, because it is "complete" (I suppose they mean this in the graph-theory sense of "complete"). In other words, the previous two graphs had very strict constraints on relationships that allowed us to "test" the graphs. The above graph doesn't have this property, and technically has the least amount structure. However, it probably presents the most accurate model of reality thus far.
This is where the story gets interesting. If you showed this model to domain experts, they would say it is too simple. A domain expert would suggest that there are other, possibly unobserved, factors involved in BW, like birth defect. If this is true, bad things can happen. Let's explore this in diagram 4. (From here, I'll actually diverge a bit from the paper, but only because the paper does it's presentation of diagrams in an odd manner.)
This new diagram introduces an unobserved variable (in blue), birth defects, and removes some causal links - but as we'll see even this simple diagram is enough to give us our "paradox". Birth defects also causes LBW and also has a (strong) positive effect on mortality. So, given this diagram, let's see what happens when we condition on LBW. Among LBW infants, if the mother didn't smoke, then we know the only other cause for LBW was birth defects. By conditioning on LBW, we have introduced a spurious correlation between smoking and birth defects ("if not this, than that") - in fact, this association is negative. And because birth defects has a positive association with mortality (and smoking has none), we have introduced a negative association between smoking and mortality.
The problem occurs because birth defects is unobserved. If we had observed birth defects, we could have controlled for it and everything would have been fine.
The story above doesn't mention observations - so why is observation so important? Even though we don't observe birth defects, the "effect" of it is still present in the outcomes. And because BW is caused by it, the effect of birth defects is passed through to BW in any analysis. This means that even though BW has no effect on the outcome, it will still look like it because it's highly correlated with birth defects.
Let's introduce a link between smoking and mortality now. The authors state
In this case, the rate ratio adjusted for birth weight would be a combination of 1) the true direct effect of smoking on mortality and 2) the bias resulting from the presence of common causes of LBW and mortality.
It's not obvious which effect is larger in magnitude, apriori. Similar to above, if we could control for birth defects, this bias would vanish.
With this diagram, we've added a link between LBW and mortality, adding more plausibility to our diagram. The same bias still remains, however.
The authors then put forward another diagram that suggests that only some types of low birth weight (LBW_B) are actually harmful. The argument is as follows.
Some instances of LBW are not harmful, like being born in a higher altitude. That is, LBW caused by higher altitude infants do not see a higher risk than their counterfactual counterparts born at normal BW in a low altitude area. This LBW is of the type LBW_A. However, a LBW in a low altitude area is of much higher risk, since the cause of the their LBW isn't LBW_A but LBW_B, and is of much higher risk. In this case, if we condition on LBW, we would introduce the same biases observed above.
This is probably my favourite part of the paper. The authors ask "why are we conditioning on BW anyways?" Researchers can be interested in two questions: what is the total effect of smoking, and what is the direct effect of smoking (aside: I learned a lot about the differences between total and direct effects by reading about the Table 2 fallacy.) If the researchers' goal is compute the total effect of smoking on mortality, then adjustment for a child variable (like BW) is not correct. (The author's state that, for many sensible reasons, BW is a cause of smoking, and not vice versa). In fact, the only adjustment needed is for confounders of smoking and mortality. This, according to the back-door criteria, is sufficient to compute the total adjustment.
However, if the researchers goal is to compute the direct effect, things are more difficult. To compute a direct effect, one needs to control for any mediators of smoking. This makes controlling for BW sensible. But then, we need to make sure that if those mediators are colliders (have more than a single cause), we need to control for all their parents too to satisfy the backdoor criteria. This is what happened in Diagram 4 above: because we conditioned on BW, but failed to condition on unobserved birth defects, our estimate of the direct effect of smoking was biased.
In reality, I imagine the original researchers did not have a causal diagram in mind when they first encountered the paradox. I doubt they even had the vocabulary of direct or total effects.
In conclusion, I hope this articled helped understand this highly important and influential paper. Apriori, we know that smoking is bad (hence the confusing "paradox"), but how often might this happen in less conspicuous contexts? Without explicitly drawing or stating a causal diagram, it's easy to make the researchers' original mistake and think some interesting phenomenon is occurring.
]]>
The idea of grouping data science tasks has been bouncing around in my head for the past six or so months, and recently I came across a paper by Miguel Hernán¹ that summarised the ideas nicely. In fact, the metaphorical "pillars" idea came from his paper. This blog article is a combination of his ideas and my own. With that, I present what I (and he) consider the three pillars of data science.
Description is what I feel the majority of data scientists in industry focus on. This could be something as simple as mean, a conversion rate, a time series, etc. or something more advanced like a statistical test. Description also includes data mining algorithms like clustering and association rules.
This is a well explored area, thanks to the 20th century focus on descriptive statistics and methods. This was the result of a few simultaneous conditions:
While this is what I feel is the most common pillar in industry, it's not talked much about in the community. I wonder if is this because it's due to the classical feel of these techniques? Or perhaps it's what I would prefer but do not believe, is that they are so well studied and understand that we need not talk about them. (Why do I not believe this? Take a look at Andrew Gelman's terrific blog almost any time he posts an article. It's always full of mistakes other researchers make, let alone people in industry). No, most of the community's attention is given to the next pillar: prediction.
We spend most of our time thinking, implementing, watching and talking about prediction. Many times a day, a new ML prediction algorithm or article will surface on Hacker News. Contrast where we are today with the previous century: we have lots of compute power, available datasets and accessible technology for prediction. Prediction, in some cases, has been widely profitable for companies. (Not to say description hasn't either. Pharma companies rely on descriptive methods to validate their work. Banks and insurance companies, too, rely on descriptive methods to control their business investments - no other technique could help.) I say some cases, because I do see a lot of excess application of prediction, often where description (or the third pillar) would be sufficient.
The next pillar is the least talked about. Surprisingly too, considering the importance of it.
Sometimes I feel ashamed to admit how ignorant I really am. To be transparent, I didn't learn much causal inference (or care) until about nine months ago. And even then, it was just fumbling around in a dark room of ideas. Slowly, however, I picked up the pieces and learned what to learn. So why should a data scientist know about causal inference?
From my experience, a data scientist gets asked questions that can only be answered with a causal inference lens. Questions like, to use my company Shopify as an example:
These questions, because of the complex nature of the system, cannot be answered with traditional summary statistics. Confounders abound, and populations are biased. We require new tools and ideas that come from causal inference.
Randomised control trials (RCTs) are very common in tech companies. This classical technique is actually generalised in causal inference frameworks. Beyond this, as a data scientist working in industry, you'll have a new perspective on what your observational data can (and cannot!) deliver.
What I find worrying is that, as a community, we are not talking about causal inference enough. This could have some serious harm. If practitioners in industry aren't aware of techniques from causal inference, they will use the wrong tool and could come to the wrong conclusion. For example, both Simpson's Paradox and Lord's Paradox, two very common phenomena, are easy explainable in a causal framework, but without this framework, people can make disastrous and harmful conclusions. For example, in the low birthweight paradox (see paper below), without causal inference there is the potential to conclude that smoking helps low birthweight infants! Furthermore, I rarely see school programs or bootcamps in data science teaching causal inference. It's almost exclusively prediction and technology!
So, rather than describe the slow and meandering way I discovered causal inference, I'll post some links below to give you a head start (probably read these in order):
There's so much more, but at least this gives you a place to learn what to learn.
This article will likely evolve over time, as I discover more and more about the boundaries of these three pillars. But I hope the reader recognises a pillar they are lacking in (for me it was causal inference). Based on this construction, I do expect there to be huge potential for practitioners of causal inference (ex: econometricians and epidemiologists come to mind) in teaching data science.
]]>Take a look at this string of numbers:
333 2 333 2 333 2 33 2 333 2 333 2 333 2 33 2 333 2 333 2 …
At first it looks like someone fell asleep on a keyboard. But there’s an inner logic to the sequence:
This sequence creates itself, number by number, as explained in the image above. Each digit refers to the number of consecutive 3s before a certain 2 appears. Specifically, the first digit refers to the number of consecutive 3s that appear before the first 2, the second digit refers to the number of 3s that appear consecutively before the second 2, and so on toward infinity.
The sequence never ends, but that won’t stop us from asking us questions about it. What is the ratio of 3s to 2s in the entire sequence?
Let's first implement the sequence in Python. We can use a queue to hold the sequence, and when we pop a new element off the queue, we can extend the queue (adding three 3s and a 2, or two 3s and a 2) based on the value of the popped element.
We could brute force the solution by computing the ratio as we extend the sequence ever longer, but this will only ever give us an approximation, and no realisation as to why the solution is so.
I'm going to show off how I played around with the problem, naively, and then started to formulate a solution. The first thing I computed was the ratio of 2s present in the sequence. This isn't the desired answer (ratio of 3s over 2s), but the ratio of 2s, which i'll call \(R\), can be used to find the original solution too: \(\frac{1- R}{R}\). Also, \(R\) can be seen as a probability, something that didn't turn out to be useful but was a possible attack. Using the Python code, R was computed to be approximately 0.2679491. So I had some place to start.
Next I looked at where the 2s were placed in the sequence. Ex: position 4, 8, 12, 15, and so on were 2s. Extending, 2s were placed at positions:
4, 8, 12, 15, 19, 23, 27, 30, 34, 38, 42, 45, 49, 53, 56, 60, 64, 68, ...
Inspecting this, it's clear that 2s will occur every fourth number, except every 4th time, when it appears after three numbers. Call this last exception a "skip". Playing around with this pattern, I constructed a crude function to estimate the position of the \(n\)th 2, \(4n - 4n / (16)\). (I didn't simplify this to make it more clear: the first term is where I expect the position to be given there were no "skips"; the second term is the estimated number of skips after \(4n\) elements of the sequence)
It gave pretty close results for values of \(n\) under 50. I could also estimate \(R\) by looking at \( \frac{n}{4n - 4 n / 16} \). For large \(n\), this gave a number pretty close to my Python approximation of \(R\) (above). But it wasn't close enough, and something was wrong. Simplifying the expression to \(\frac{1}{4 - \frac{1}{4}}\) - I guessed that maybe this looks like a continued fraction! Trying a more accurate approximation \(\frac{1}{4 - \frac{1}{1 - \frac{1}{4}}}\) gave an even better result! Newly motivated, I went back to my "crude approximation function". In the limit, the correct function would look like \( 4n - Rn\). Why? Again, given no "skips" we expect the \(n\)th 2 to be at \(4n\), then subtract the number of "skips", which is exactly the number of 2s, given by \(Rn\). Thus \(R = \frac{1}{ 4 - R} \). (This confirms the continued fraction). Solving the quadratic, \(R = 2 - \sqrt{3} \approx 0.26794919 \).
Now we can solve the original problem, given by \(\frac{1- R}{R} = \frac{\sqrt{3} - 1}{2 - \sqrt{3}} \approx 2.7320508\).
[Image credit to FiveThirtyEight]
]]>Last week, I achieved a goal of hitting 200 problems solved in Project Euler. If you are unfamiliar with Project Euler, it's a website where users can solve tricky problems that require both mathematics and programming to solve. An example question:
(BBBW) | (B,BBW) | (B,B,BW) | (B,B,B,W) |
(B,BB,W) | (BBB,W) | (BB,BW) |
I started in Jan. 2016, and was able to collect my progress of completion over time:
There is also the concept of the the "difficulty" of a problem, provided by the website itself as a measure of how often participants get the answer wrong. Here is the cumulative difficulty:
I'm not finished with Project Euler yet - there are way to many problems I spent time on but didn't solve and they are just itching me. However, I am taking a break for a while so I can focus on some other projects. After all, solving 200 problems was technically my lifetime goal.
If a person asked me "is Project Euler worth the time?" - I wouldn't immediately say yes. You do need lots of time, and it really is a personal thing (there is a leaderboard, but it's not nearly as public as, say, StackOverflow's leaderboards). You also need to think about the opportunity cost - will solving these problems further your career or research? Personally, I feel that Project Euler rounded out my understanding of computer science and mathematics, and not so heavily loaded on statistics and Python applied to statistics.
My problem was not as easy as this:
Essentially, all the information I had was the following: my job threw a NullPointerException somewhere when it tried to process this new dataset:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 533 in stage 1.0 failed 4 times, most recent failure: Lost task 533.3 in stage 1.0. java.lang.NullPointerException
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1454)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1442)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1441)
...
But I still needed to finding this offending row. The idea of a binary search over the dataset appealed to me, but the data was distributed: there was no sense of an “order” to it, that is, I had no concept of a middle value to split the data on. The next idea I thought of was using modulus. Suppose each row had an integer 1,2,3,… associated with it, then I could split the data modulo 2 and test either side, 0 or 1, for a failure. Suppose that the 1s-half contained the failure. Then I should look at splitting the dataset by 1 and 3 modulo 4. And so on. This idea could be applied recursively. In practice, it relies on the fact that, given \(x \equiv k \bmod 2^n\) implies either \(x \equiv k \bmod 2^{n+1}\) or \(x \equiv k + 2^n \bmod 2^{n+1}\). Using this, I can recursively halve my dataset until I have found a single row (or a much smaller set atleast). I’ll explain how to associate each row to an integer in a moment, but first, here’s a sample algorithm in Python:
This is a toy example, but the gist of it is that each round I halve the dataset
and test one of the halves for the desired behaviour (implemented in test
). The running time of this algorithm is \(\mathcal{O}(T\log{n})\), where \(n\) is the size of the dataset and \(T\) is the running time of the job.
This assumes there is an easy way to assign an integer to each row. When implementing this idea in Spark to a distributed dataset, there is no way to index each row consecutively (Spark does have zipWithIndex
in its RDD layer, though this does an internal collect
which isn’t very desirable.) However there is a monotonically_increasing_id
function, which gives each row a monotonically increasing integer, but not necessarily consecutive. Fortunately this algorithm still works (and typically this is where a normal binary search would break). The runtime however is no longer \(\mathcal{O}(T\log{n})\), but \(\mathcal{O}(T\log{M})\), where \(M\) is the maximum integer assigned. So in Spark, my implementation to find the offending row looked like:
So I let this run for an hour or so, and it worked! I was very happy to find the offending row. Though next time, I would change this:
cache
right before your job fails, else you’ll be rerunning the same computations over and overI’m not totally sure this is the fastest way to find the offending line (though, given the tools plus information provided, it might be), but it works and is reliable.
]]>$$ \min_{\theta} -\ell(\theta) + \lambda ||\theta||_p^p $$
where \(\ell\) is the log-likelihood and \(||\cdot||\) is the \(p\) norm. This family of problems is typically solved by calculus because both terms are easy to differentiate ( when \(p\) is an integer greater than 1). Actually this is backwards: we want to solve the MLE using calculus (because that's our hammer) and in order to add a penalizer term, we also need it to be differentiable. Hence why we historically used the 2-norm.
If we don't solve the optimization problem with calculus, well then we can be more flexible with our penalizer term. I took this liberty when I developed the optimizations in lifetimes, my Python library for recency/frequency analysis. The MLE in the model is too complicated to differentiate, so I use numerical methods to find the minimum (Nelder-Mead to be exact). Because of this, I am free to add any penalizer term I wish, deviating as I choose from the traditional 2-norm. I made the choice of \(\log\), specifically:
$$ \min_{\alpha, \beta} -\ell(\alpha, \beta) + \lambda \left( \log(\alpha) + \log(\beta) \right) $$
This was my logic when I first developed lifetimes. Things were going well, until I started noticing some datasets that would produce unstable convergence only when the penalizer coefficient \(\lambda\) was positive: it was driving some variables to nearly 0. How could this be? Shouldn't any positive penalizer coefficient help convergence? For this, we'll take two perspectives of this problem.
I probably don't need to say it, but the log of a value less than 1 is negative. More extreme, the log of a very small value is very very negative (because the rate of change of log near zero gets larger as we approach the asymptote). Thus, during optimization, when a parameter starts to get small, the overall penalizer term starts to gain momentum. In fact, the optimizer starts to shrink a particular parameter to near 0 because that really really helps the overall optimization.
This is obviously not what I wanted. Sure, I wanted to keep values from being too large, but I certainly did not want to reduce parameters to near zero! So it made sense that when I had \(\lambda\) equal to 0 I did not observe this behaviour.
On the other extreme, the \(\log\) penalizer is kinda a terrible penalizer against large values too. An order of magnitude increase in a parameter barely makes a difference in the log of it! It's an extremely sub-linear function, so it doesn't really penalize large parameter sizes well.
As noted in a chapter in Bayesian Methods for Hackers, there is a really beautiful and useful relationship between MLE penalizer terms and Bayesian priors. Simply, it comes down to that the prior is equivalent to the negative exponential of the penalizer term. Thus, the 2-norm penalizer term is a Normal prior on the unknowns; the 1-norm penalizer term is a Laplace prior, and so on. What is then our \(\log\) prior? Well, it's a \(\exp(-\log(\theta)) = \frac{1}{\theta}\) prior on \(\theta\). This is strange, no? It's an improper prior, and it's in fact a Jeffery's prior, so I'm basically saying "I have no idea what this scale parameter should be" - not what I want to be doing.
As much as I like the scale-free property of the \(\log\), it's time to say goodbye to it in favor of another. I think for my purposes, I'll try the Laplace prior/1-norm penalizer as it's a nice balance between disallowing extremely large values and not penalizing the largest parameter too strongly.
Update: I actually went with the square penalizer in the end. Why? The 1-norm was still sending too many values to zero, when I really felt strongly that no value should be zero.
]]>Let \(S_n = \sum_{i=1}^n U_i \) be the sum of \(n\) Uniform random variables. Let \(N\) be the index of the first time the sum exceeds 1 (so \(S_{N-1} < 1\) and \(S_{N} \ge 1\)). The distribution of \(U_N\) is \(e - e^{1-u}\).This result is related to the really astounding result: \(N\) (the same \(N\) defined above) has expected value equal to \(e\). Yes, that's right, the average number of uniform random variables needed to exceed 1 is exactely \(e\). I just love this surprising result, but have still not found a good, intuitive reason as to why this is true. Of course it is true mathematically, and the proof is not too hard, but I just can't see why. Anyways, below is the proof of the first result.
I've seen some really interesting probability & numerical solutions using a strategy called Poissonization, but Googling for it revealed very few resources (just some references in some textbooks that I don't have quick access to). Below are my notes and repository for Poissonization. After we introduce the theory, we'll do some examples.
The technique relies on the following theorem:
Theorem: Let \(N \sim \text{Poi}(\lambda)\) and suppose \(N=n, (X_1, X_2, ... X_k) \sim \text{Multi}(n, p_1, p_2, ..., p_k)\). Then, marginally, \(X_1, X_2, ..., X_k\) are are independent Poisson, with \(X_i \sim \text{Poi}(p_i \lambda)\). [1]
The proof of this theorem is as follows. First, by the law of total probability:
$$
\begin{align*}
P(X_1=x_1, ..., X_k = x_k\;|\; n, \mathbf{p})
& = \sum_{n=0}^{\infty} P(X_1=x_1, ..., X_k = x_k \;|\; N=n) \frac{e^{-\lambda}\lambda^n}{n!} \\
& = \sum_{n=0}^{\infty} \frac{(x_1 + ... + x_k)!}{x_1!x_2! \cdots x_k!}p_1^{x_1}\cdots p_k^{x_k}\frac{e^{-\lambda}\lambda^n}{n!}I_{x_1 + ... + x_k = n} \\
& = e^{-\lambda}\lambda^{x_1} \cdots \lambda^{x_k}p_1^{x_1} \cdots p_k^{x_k}\frac{1}{x_1!x_2! \cdots x_k!}\\
& = e^{-\lambda}(\lambda p_1)^{x_1} \cdots (\lambda p_k)^{x_k}\frac{1}{x_1!x_2! \cdots x_k!}\\
& = \prod_{i=1}^k \frac{e^{-\lambda p_i}(\lambda p_i)^{x_i}}{x_i!}\;\;\;\;\blacksquare
\end{align*}
$$
Let's see how we can use this to solve probability problems. Let \((Y_1, Y_2, ...Y_k)\) be \(\text{Multi}(n, p_1, p_2, ..., p_k)\), and we are interested in when the realized values fall into some set \(A\). Let \(X_i\) be defined as above, that is, \(X_i \sim \text{Poi}(p_i \lambda)\). Then, from above, we can write:
$$
\begin{align*}
P((X_1, ... X_k) \in A) & = \sum_{i=0}^\infty \frac{e^{-\lambda}\lambda^n}{n!} P((Y_1, ... Y_k) \in A)\\
& = e^{-\lambda} \sum_{i=0}^\infty \frac{\lambda^n}{n!} P((Y_1, ... Y_k) \in A)\\
& e^{\lambda}P((X_1, ... X_k) \in A) = \sum_{i=0}^\infty \frac{\lambda^n}{n!} P((Y_1, ... Y_k) \in A) \;\;\;(★)
\end{align*}
$$
Both sides of the last line are infinite series of \(\lambda^n\), and thus their coefficients must be equal. Thus, as \(N=n\), \(P((Y_1, ... Y_k) \in A)\) is equal to \(n! c(n)\) where \(c(n)\) is the coefficient of \(\lambda^n\) in the expansion of \(e^{\lambda}P((X_1, ... X_k) \in A)\).
Let's try an example:
n defects are randomly distributed amongst k integrated-circuit chips produced by a factory (any number of defects may be found on a chip and each defect is independent of the other defects). If n = 4, k=7, find the probability that there is a chip with at least 3 defects.
This is a classic case where it's easier to compute the complementary probability: what is the probability that all chips have less than 3 defects (and then we subtract this from 1 to get the original probability).
Since each defect is randomly distributed, \(p_i = p = \frac{1}{7}\). Below, we invoke Poissonization by letting \(n\) be randomly distributed as \( \text{Poi}(\lambda)\) (later we will condition on its actual value of 4). That means that we can assume the number of defects on a chip is a Poisson random variable with rate \(\lambda p = \frac{\lambda}{7}\).
$$
\begin{align*}
P(\text{All chips have $\le$ 2 defects})
& = \Big(P(\text{Chip has 0 defect}) + P(\text{Chip has 1 defect}) + P(\text{Chip has 2 defect})\Big)^7 \\
& = \Big(e^{-\lambda/7} + \frac{e^{-\lambda/7}(\lambda/7)^1}{1!} + \frac{e^{-\lambda/7}(\lambda/7)^2}{2!}\Big)^7\\
& = e^{-\lambda}\Big(1 + \frac{\lambda}{7} + \frac{\lambda^2}{2!7^2}\Big)^7\\
\end{align*}
$$
When we multiply the above by \(e^{\lambda}\), and imagine the right-hand side is expanded, we get a form that is familiar to equation (★). This means that we are interested in the coefficient of \(\lambda^4\). We could expand this by hand, but instead let's expand this computationally. For this, it's useful to know a bit about multinomial expansions. But basically you are looking for solutions to
$$
k_0 + k_1 + k_2 = 7\\
0 k_0 + 1 k_2 + 2 k_2 = 4
$$
The first equation comes from the constraint for the powers of a multinomial expansion to sum to the outside exponent. The second equation comes from our need to find powers of \(\lambda\) that sum up to 4 (the 0,1,2 come from the existing powers of \(\lambda\)). Two equations, three unknowns, solving for the values:
$$
\begin{align*}
k_2 &= k_0 - 3 \\
k_1 &= 10 - 2k_0\\
3 &\le k_0 \le 5
\end{align*}
$$
Computationally, we can write this in Python:
We get the answer 0.92711, so the complement to this is 0.07289 and is the probability of the original question.
This comes from FiveThiryEight's Riddler series: Santa Needs Some Help With Math. I'll quote the problem here:
In Santa’s workshop, elves make toys during a shift each day. On the overhead radio, Christmas music plays, with a program randomly selecting songs from a large playlist.
During any given shift, the elves hear 100 songs. A cranky elf named Cranky has taken to throwing snowballs at everyone if he hears the same song twice. This has happened during about half of the shifts. One day, a mathematically inclined elf named Mathy tires of Cranky’s sodden outbursts. So Mathy decides to use what he knows to figure out how large Santa’s playlist actually is.
Help Mathy out: How large is Santa’s playlist?
Okay, so let's translate this to our set up. We have \(k\) songs to choose from, and we sample 100 of them. Turns out that P(any song is played more than once) = 0.5. So in this case, we know the probability, but we don't know the \(k\). So, similarly to how we started above:
$$
\begin{align*}
P(\text{songs have 1 or less plays})
& = \Big(P(\text{songs has 0 plays}) + P(\text{songs has 1 plays})\Big)^k \\
& = \Big(e^{-\lambda/k} + \frac{e^{-\lambda/k}(\lambda/k)^1}{1!}\Big)^k\\
& = e^{-\lambda}\Big(1 + \frac{\lambda}{k}\Big)^k\\
\end{align*}
$$
Yay we have a binomial! They are much easier to work with. Now we need to answer:
What is the coefficient of \(λ^{100}\)?
Call this coefficient \(c_{100}(k)\). Since we have a binomial, expanding the above is easy:
$$
c_{100}(k) = \frac{C(k, 100)}{k^{100}}
$$
Using the other piece of information above, that the probability of an "event" is 0.5, we can write an equation down:
$$
\begin{align*}
0.5
& = 100! c_{100}(k) \\
& = 100! \frac{C(k, 100)}{k^{100}} \\
& = \frac{k!}{(k-100)! k^{100}} \\
\end{align*}
$$
$$
2\frac{k!}{(k-100)!} - k^{100} = 0
$$
We can use Python (!!) to numerically solve this equation for \(k\).
Because the sign flips at 7175, we conclude that either 7174 or 7175 is the closest answer.
Looking for more examples?
[1] DasGupta, A. Probability for Statistics and Machine Learning. http://link.springer.com/book/10.1007%2F978-1-4419-9634-3
]]>
Python developers are commonly using Python as a tool to explore datasets - but what if we reverse that analysis lens back on to the developer? In this talk, Cam will use Python as a data analysis tool to explore Python developers and code. With millions of data points, mostly scraped from Github and Stackoverflow, we'll reexamine who the Python developer is, and how Python developers write code.
]]>
You can purchase it on Amazon today!
]]>On the other hand, a non-constructive proof does not detail how to build the object, just states that it must exist. Often this is done by finding a proof by contradiction, but some interesting proofs use probability theory. For example, if I'm interested in showing that there exists an object with some property less than 0, then an argument of the form "the expected value of this property is 0, and if I know there is an object with property more than 0, there must be an object with property less than 0, too". I didn't find that specific object, but I determined it must exist.
There is an interesting trend occurring in data analysis that is analogous to a non-constructive proof. The author won't find the object desired, but will state, statistically, that it does exist. Sound odd?
Suppose it was early 2015, and I was interested in analyzing the Billboard's 2015 Song of the Summer. I might ask, "has the Song of the Summer been released yet?" By analyzing when the winning song was released in previous years, I can get a distribution of when this year's Song of the Summer might be released. If I'm past the majority of previous dates, then I can statistically conclude that 2015's Song of the Summer exists. I haven't shown what the song of the summer is, just that it exists.
In fact, this is what FiveThirtyEight did this year: they suggested that on May 5th, you have probably already heard the 2015 Song of the Summer. They didn't guess which song, just that it existed.
I find this to be a really powerful technique. By showing something exists, it can narrow down your search significantly. Suppose you and a friend were betting on what movie will win the Best Picture Oscar. You could probably do quite well by betting on all movies made after October, and nothing before. This is because the Oscar winner often is released (exists) after October.
Another extension, from a decision maker's perspective, is to understand when something should exist, and place yourself there. For example, good luck to the artist who hopes to achieve Song of the Summer with a late May release! Similar ideas apply to presidential elections (when should a candidate enter the nominee race to maximize chances of being president?) and theatrical releases hoping to win Oscars.
Any other applications? This feels like one of those tricks that is not generally applicable, but very powerful in a some situations. Leave other ideas in the comments!
]]>
This is a very simple Bayesian problem, on paper! See Allen's book for the solution (link above). What if we wanted to use PyMC to solve it? Below is the solution (originally from CrossValidated).
It looks like there is a bug in 2.3.4 (the most recent version) that is causing the wrong inference. This took me a while to discover, but it was solved when I downgraded to PyMC 2.3.2.
import pymc as pm
p = [
#brown, yellow, red, green, orange, tan, blue
[.3, .2, .2, .1, .1, .1, .0 ], # 1994 bag
[.13, .14, .13, .2, .16, .0, .24] # 1996 bag
]
bag = pm.Categorical('which_bag', [0.5, 0.5]) # prior on bag selected
@pm.deterministic
def first_bag_selection(p=p, bag=bag):
return p[bag]
@pm.deterministic
def second_bag_selection(p=p, bag=bag):
return p[1-bag]
# observe yellow in bag 1
obs_1 = pm.Categorical('bag_1', first_bag_selection, value=1, observed=True)
#observe green in bag 2
obs_2 = pm.Categorical('bag_2', second_bag_selection, value=3, observed=True)
mcmc = pm.MCMC([bag, p, first_bag_selection, second_bag_selection, obs_1, obs_2])
mcmc.sample(15000,3000)
bag_trace = mcmc.trace('which_bag')[:]
# what is the probability the bag chosen is from 1996?
print (bag_trace == 1).sum()*1./bag_trace.shape[0]
# should be about .26
]]>
We construct a dynamical population whose individuals are assigned elements from an algebraic group \(G\) and subject them to sexual reproduction. We investigate the relationship between the dynamical system and the underlying group and present three dynamical properties equivalent to the standard group properties.
Consider a sufficiently large population of \(N\) individuals. Sexual reproduction between individuals is a binary action, that is, sex involves two individuals. An interesting assumption is to consider if the population consists of individuals from an algebraic group. Recall the definition of a group : A set, \(G\), and binary operator \(\ast\) is called a group iff
there exists an identity element, \(e\), s.t. \(\forall x \in G, x\ast e=e \ast x = x\).
\(\forall x \in G\) there exists an inverse, \(x^{-1}\), s.t. \(x\ast x^{-1}=x^{-1}\ast x = e\).
\(\forall x,y \in G, x\ast y \in G\) and \(y\ast x \in G\).
What would the behaviour of a sexual system look like if all individuals were members of a group? Can we find such systems in nature, or are our constructions purely artificial?