```
kmf = KaplanMeierFitter().fit(df['T'], df['E'])
kmf.plot(figsize=(11,6));
```

To borrow a term from finance, we clearly have different *regimes* that a customer goes through: periods of low churn and periods of high churn, both of which are predictable. This predictability and "sharp" changes in hazards suggests that a piecewise hazard model may work well: hazard is constant during intervals, but varies over different intervals.

Furthermore, we can imagine that individual customer variables influence their likelihood to churn as well. Since we have baseline information, we can fit a regression model. For simplicity, let's assume that a customer's hazard is constant in each period, however it varies over each customer (heterogeneity in customers). Hat tip to StatWonk for this model:

Our hazard model looks like¹: $$ h(t\;|\;x) = \begin{cases} \lambda_0(x)^{-1}, & t \le \tau_0 \\ \lambda_1(x)^{-1} & \tau_0 < t \le \tau_1 \\ \lambda_2(x)^{-1} & \tau_1 < t \le \tau_2 \\ ... \end{cases} $$

and \(\lambda_i(x) = \exp(\mathbf{\beta}_i x^T), \;\; \mathbf{\beta}_i = (\beta_{i,1}, \beta_{i,2}, ...)\). That is, each period has a hazard rate, \(\lambda_i\), that is the exponential of a linear model. The parameters of each linear model are unique to that period - different periods have different parameters (later we will generalize this).

Why do I want a model like this? Well, it offers lots of flexibility (at the cost of efficiency though), but importantly I can see:

- Influence of variables over time.
- Looking at important variables at specific "drops" (or regime changes). For example, what variables cause the large drop at the start? What variables prevent death at the second billing?
- Predictive power: since we model the hazard more accurately (we hope) than a simpler parametric form, we have better estimates of a subjects survival curve.

One interesting point is that this model is *not* an accelerated failure time model even though the behaviour in each interval looks like one. This is because the breakpoints (intervals) do not change in response (contract or dilate) to the covariates (but that's an interesting extension).

¹ I specify the reciprocal because that follows lifelines convention for exponential and Weibull hazards. In practice, it means the interpretation of the sign is possibly different.

```
pew = PiecewiseExponentialRegressionFitter(
breakpoints=breakpoints)\
.fit(df, "T", "E")
```

Above we fit the regression model. We supplied a list of breakpoints that we inferred from the survival function and from our domain knowledge.

Let's first look at the average hazard in each interval, over time. We should see that during periods of high customer churn, we also have a high hazard. We should also see that the hazard is constant in each interval.

```
fig, ax = plt.subplots(1,1)
kmf.plot(figsize=(11,6), ax=ax);
ax.legend(loc="upper left")
ax.set_ylabel("Survival")
ax2 = ax.twinx()
pew.predict_cumulative_hazard(
pew._norm_mean.to_frame(name='average hazard').T,
times=np.arange(0, 110),
).diff().plot(ax=ax2, c='k', alpha=0.80)
ax2.legend(loc="upper right")
ax2.set_ylabel("Hazard")
```

It's obvious that the highest average churn is in the first few days, and then high again in the latter billing periods.

So far, we have only been looking at the aggregated population - that is, we haven't looked at what variables are associated with churning. Let's first start with investigating what is causing (or associated with) the drop at the second billing event (~day 30).

```
fig, ax = plt.subplots(figsize=(10, 4))
pew.plot(parameter=['lambda_2_'], ax=ax);
```

From this forest plot, we can see that the `var1`

has a *protective* effect, that is, customers with a high `var1`

are much less likely to churn in the second billing periods. `var2`

has little effect, but possibly negative. From a business point of view, maximizing `var1`

for customers would be a good move (assuming it's a causal relationship).

We can look at all the coefficients in one large forest plot, see below. We see a distinct alternating pattern in the `_intercepts`

variable. This makes sense, as our hazard rate shifts between high and low churn regimes. The influence of `var1`

seems to spike in the 3rd interval (`lambda_2_`

), and then decays back to zero. The influence of `var2`

looks like it starts to become more negative over time, that is, is associated with more churn over time.

```
fig, ax = plt.subplots(figsize=(10, 10))
pew.plot(ax=ax);
```

If we suspect there is some parameter sharing between intervals, or we want to regularize (and hence share information) between intervals, we can include a penalizer which penalizes the variance of the estimates per covariate.

Note: we do *not* penalize the intercept, currently. This is a modelers decision, but I think it's better not too.

Specifically, our penalized log-likelihood, \(PLL\), looks like:

$$ PLL = LL - \alpha \sum_j \hat{\sigma}_j^2 $$where \(\hat{\sigma}_j\) is the standard deviation of \(\beta_{i, j}\) over all periods \(i\). This acts as a regularizer and much like a multilevel component in Bayesian statistics. In the above inference, we implicitly set \(\alpha\) equal to 0. Below we examine some more cases of varying \(\alpha\). First we set \(\alpha\) to an extremely large value, which should push the variances of the estimates to zero.

```
# Extreme case, note that all the covariates' parameters are almost identical.
pew = PiecewiseExponentialRegressionFitter(
breakpoints=breakpoints,
penalizer=20.0)\
.fit(df, "T", "E")
fig, ax = plt.subplots(figsize=(10, 10))
pew.plot(ax=ax);
```

As we suspected, a very high penalizer will constrain the same parameter between intervals to be equal (and hence 0 variance). This is the same as the model:

$$ h(t\;|\;x) = \begin{cases} \lambda_0(x)^{-1}, & t \le \tau_0 \\ \lambda_1(x)^{-1} & \tau_0 < t \le \tau_1 \\ \lambda_2(x)^{-1} & \tau_1 < t \le \tau_2 \\ ... \end{cases} $$and \(\lambda_i(x) = \exp(\mathbf{\beta_{0,i} + \beta} x^T), \;\; \mathbf{\beta} = (\beta_{1}, \beta_{2}, ...)\). Note the reuse of the \(\beta\)s between intervals.

This model is the same model proposed in "Piecewise Exponential Models for Survival Data with Covariates".

One nice property of this model is that because of the extreme information sharing between intervals, we have maximum information for inferences, and hence small standard errors per parameter. However, if the parameters effect is truly time-varying (and not constant), then the standard error will be inflated and a less constrained model is better.

Below we examine a in-between penalty, and compare it to the zero penalty.

```
# less extreme case
pew = PiecewiseExponentialRegressionFitter(
breakpoints=breakpoints,
penalizer=.25)\
.fit(df, "T", "E")
fig, ax = plt.subplots(figsize=(10, 10))
pew.plot(ax=ax, fmt="s", label="small penalty on variance")
# compare this to the no penalizer case
pew_no_penalty = PiecewiseExponentialRegressionFitter(
breakpoints=breakpoints,
penalizer=0)\
.fit(df, "T", "E")
pew_no_penalty.plot(ax=ax, c="r", fmt="o", label="no penalty on variance")
plt.legend();
```

We can see that:

- on average, the standard errors are smaller in the penalty case
- parameters are pushed closer together (they will converge to their average if we keep increasing the penalty)
- the intercepts are barely effected.

I think, in practice, adding a small penalty is the right thing to do. It's extremely unlikely that intervals are independent, and extremely unlikely that parameters are constant over intervals.

Like all *lifelines* models, we have prediction methods too. This is where we can see customer heterogeneity vividly.

```
# Some prediction methods
pew.predict_survival_function(df.loc[0:3]).plot(figsize=(10, 5));
```

```
pew.predict_cumulative_hazard(df.loc[0:3]).plot(figsize=(10, 5));
```

```
pew.predict_median(df.loc[0:5])
```

In conclusion, this model is pretty flexible and it is one that can encourage more questions to be asked. Beyond just SaaS churn, one can think of other application of piecewise regression models: employee churn after their stock option vesting cliff, mortality during different life stages, or modelling time-varying parameters.

Future extensions include adding support for time-varying covariates. Stay tuned!

`pip install lifelines==0.25.0`

Formulas, which should really be called Wilkinson-style notation but everyone just calls them formulas, is a lightweight-grammar for describing additive relationships. If you have used R, you'll likely be familiar with formulas. They are less common in Python, so here's an example: Writing `age + salary`

is short-form for the expanded additive model: \(\beta_0 + \beta_1\text{age} + \beta_2\text{salary}\). Expanding on that, the grammar allows interaction terms quite easily: `age + salary + age : salary`

is

$$ \beta_0 + \beta_1\text{age} + \beta_2\text{salary} + \beta_3 \text{age $\cdot$ salary}$$

Actually, the previous use is so common that a short form of this entire interaction and monomial terms exists: `age * salary`

. These are just examples, but there is a large set of transformations, ways to handle categorical variables, etc.

This is just the grammar though, and a compiler is needed to parse the string, and then translate the parsed string into code. This code can transform an initial dataset into its transformed version that allows the \(\beta\)s to be estimated. For example, transforming the following raw dataset using the formula `age * salary`

:

```
df = pd.DataFrame({
'age': [35, 36, 40, 25, 55],
'salary': [60, 35, 80, 50, 100]
})
```

becomes:

```
pd.DataFrame({
'age': [35, 36, 40, 25, 55],
'salary': [60, 35, 80, 50, 100],
'age:salary': [2100, 1260, 2400, 1750, 5500]
'Intercept': [1, 1, 1, 1, 1]
})
```

This new dataframe can be given to any regression library to fit the \(\beta\)s. In Python, libraries like Patsy and the new Formulaic are the parser + code-generator.

Anyways, *lifelines* previously requested that all transformations occur in a preprocessing step, and the final dataframe given to a *lifelines* model. This created some problems, however:

- The user had to learn Patsy in order to use formulas in
*lifelines*, which is a barrier to entry.

- In methods like
`plot_covariate_groups`

which relied on examining potentially more than one variable, the user had to manually recreate potential interaction terms or more complicated transformations.

- Users often had to
`drop`

columns from their dataframe prior to fitting with*lifelines*, rather than telling*lifelines*what they wanted to use in the regression. This led to some ugly code.

With *lifelines* v0.25.0, formulas are now native (though optional) to *lifelines* model (old code should still work as well):

```
from lifelines import CoxPHFitter
from lifelines.datasets import load_rossi
rossi = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi, "week", "arrest", formula="age + fin + prio + paro * mar")
cph.print_summary(columns=['coef', 'se(coef)', '-log2(p)'])
"""
<lifelines.CoxPHFitter: fitted with 432 total observations, 318 right-censored observations>
duration col = 'week'
event col = 'arrest'
baseline estimation = breslow
number of observations = 432
number of events observed = 114
partial log-likelihood = -659.49
time fit was run = 2020-07-27 15:33:44 UTC
---
coef se(coef) -log2(p)
covariate
age -0.06 0.02 8.21
fin -0.37 0.19 4.19
prio 0.10 0.03 10.96
paro -0.10 0.20 0.73
mar -0.78 0.73 1.81
paro:mar 0.36 0.84 0.57
---
Concordance = 0.63
Partial AIC = 1330.98
log-likelihood ratio test = 31.78 on 6 df
-log2(p) of ll-ratio test = 15.76
"""
```

However, the *real* strength in formulas is their ability to create *basis splines* easily. Basis splines are highly flexible non-linear transformations of a variable - they are essential in the modern statistical inference toolkit:

```
cph.fit(rossi, "week", "arrest", formula="age + fin + bs(prio, df=3)")
cph.print_summary(columns=['coef', 'se(coef)', '-log2(p)'])
"""
<lifelines.CoxPHFitter: fitted with 432 total observations, 318 right-censored observations>
duration col = 'week'
event col = 'arrest'
baseline estimation = breslow
number of observations = 432
number of events observed = 114
partial log-likelihood = -659.88
time fit was run = 2020-07-27 15:36:34 UTC
---
coef se(coef) -log2(p)
covariate
age -0.07 0.02 9.91
fin -0.32 0.19 3.49
bs(prio, df=3)[0] 1.41 0.96 2.82
bs(prio, df=3)[1] -0.18 1.02 0.22
bs(prio, df=3)[2] 2.82 0.81 11.06
---
Concordance = 0.63
Partial AIC = 1329.76
log-likelihood ratio test = 31.00 on 5 df
-log2(p) of ll-ratio test = 16.70
"""
```

Importantly, the new transform logic in *lifelines* is extended to the `predict`

and plotting methods too.

For models that have more than one parameter, like `WeibullAFTFitter`

, formulas can be crafted for each parameter.

```
from lifelines import WeibullAFTFitter
wf = WeibullAFTFitter()
wf.fit(rossi, "week", "arrest", formula="age + fin + paro * mar", ancillary="age * fin")
wf.print_summary(columns=['coef', 'se(coef)', '-log2(p)'])
"""
<lifelines.WeibullAFTFitter: fitted with 432 total observations, 318 right-censored observations>
duration col = 'week'
event col = 'arrest'
number of observations = 432
number of events observed = 114
log-likelihood = -681.82
time fit was run = 2020-07-27 16:49:31 UTC
---
coef se(coef) -log2(p)
param covariate
lambda_ Intercept 2.19 0.68 9.59
age 0.11 0.03 10.13
fin 0.12 0.19 0.97
paro 0.17 0.14 2.19
mar 0.33 0.53 0.93
paro:mar -0.14 0.61 0.29
rho_ Intercept 1.36 0.41 9.98
age -0.05 0.02 7.25
fin -0.38 0.54 1.04
age:fin 0.02 0.02 1.74
---
Concordance = 0.62
AIC = 1383.65
log-likelihood ratio test = 29.60 on 8 df
-log2(p) of ll-ratio test = 11.97
"""
```

One of my favourite types of research articles are on changes to improving scientific understanding and communication. Last year, a paper, by T. Morris *et al*., came out that surveyed statisticians, clinicians and stakeholders on how to better communicate the workhorse of survival analysis, Kaplan-Meier (KM) curves. A number of potential options were shown to the over 1100 survey participants, and ratings to each option were made. The survey results show two changes that could be made to improve understanding of KM curves.

1. Always show confidence intervals around the curves (*lifelines* does this by default)

2. Present all the summary information at the bottom. Many KM curves would present a*t-risk* numbers, and *lifelines* had this option as well:

But the participants really liked *all* summary information, including deaths and censorship, not just at-risk. *lifelines* now presents all this information:

It's not displayed by default (that may change), but with the `at_risk_counts`

kwarg in the call to `KaplanMeierFitter.plot`

.

This small change has the potential to be a massive improvement in understanding these plots. I'm so happy I saw this paper come across my desk.

Performance improvements was actually part of a release a few weeks back and not the v0.25.0, but I wanted to highlight it anyways. I found a bug in the switching algorithm that we use for choosing which algorithm to run for the Cox model (see post on how that's done here). This bug made the choice of algorithm sub-optimal, specifically for very large datasets. After fixing the bug, `CoxPHFitter`

is able to process millions of rows in less than a tenth of a second (not to trash on it, but just for comparison: R takes on the order of seconds, and the core algorithm is written in C). This is probably one of the fastest Cox-Efron models, and it's written only in Python. This makes me really proud.

I'm really happy with where this release landed. There were lots of ups and downs in code quality and structure, but I feel like I settled on something nice. Formulas are a big deal, and will take *lifelines* to the next level. Future release will have support for partial-effect plots, and more.

You can see all the improvements, and important API changes, in v0.25.0 here.

]]>1. It helps dethrone the Proportional Hazard (PH) model as the default survival model. People like the PH model because it doesn't make any distributional assumptions. However, like a Trojan horse, there are very strong *implicit* assumptions that are inherited, often which are too restricting. Suffice to say, I am not a big proponent of the PH model.

2. A spline-based AFT model weakens the CPH's throne because the model can fit to a larger space of potential models. For example, the Weibull AFT model is a special case of this new AFT model. The authors of the paper also carefully demonstrate that it often has lower bias, standard error, or AIC than other popular AFT models (Generalized Gamma, Generalized F).

3. AFT models are just simpler to explain. Coefficients of AFT models have a much nicer interpretation than PH or Proportional Odds (PO) models. Simply: a positive (negative) coefficient multiplicatively accelerates (decelerates) a subject's time-to-event. So, a coefficient of 2 means that a subject experiences the event twice as fast as a baseline subject, on average.

I'm too lazy to give the mathematical details of the model (just drank a strong beer), but what I do want to mention is that the model is implementable in lifelines using our custom model syntax. Here's the code.

Update: it's now part of lifelines as `lifelines.CRCSplineFitter`. Happy coding!

]]>But first, I think we are familiar with an \(L_1\) penalty, but what is an \(L_0\) penalty then? If you work out the math, it is a penalty that *counts the number* of *non-zero **coefficients*, independent of the magnitude of the coefficients:

$$ll^*(\theta, x) = \sum_i^N ll(\theta, x_i) - \lambda \sum_{k=0}^D 1_{\theta_k \ne 0}$$

where \(D\) is the number of potential parameters. Thinking about this for a moment, this means that the \(L_0\) penalty minimizes the AIC, since the AIC is:

$$AIC = -2 ll + 2D^*$$

where \(D^*\) is the number of parameters in the model. It turns out that \(L_0\) penalties encourage *lots* of sparsity in their solutions, much more than \(L_1\).

Given that, the \(L_{1/2}\) penalty is the balance between penalizing the magnitudes of the coefficients and encouraging lots of sparsity. The paper linked above gives reasons why \(L_{1/2}\) is perhaps superior to both \(L_1\) and \(L_0\). Importantly, solving \(L_0\) is NP-hard because it involves a combinatorial explosion of potential solutions that can't be solved with gradient methods.

The authors provide a very simple algorithm for solving the \(L_{1/2}\) problem, see Section 3 of the paper. It involves repeatedly solving a related \(L_1\) problem with updating coefficient-specific penalizer values. In *lifelines*, we recently introduced the ability to set specific coefficient penalizer values, and we can solve \(L_1\) problems too. Let's see if we can solve \(L_{1/2}\) problems now:

```
def l_one_half_cox(lambda_, df, T, E):
EPSILON = 0.00001
weights = lambda_ * np.ones(df.shape[1]-2)
cph = CoxPHFitter(l1_ratio=1.0, penalizer=weights)
cph.fit(rossi, "week", "arrest")
max_iter = 20
i = 1
while i < max_iter:
weights = lambda_ / (np.sqrt((cph.params_.abs()).values) + EPSILON)
cph = CoxPHFitter(l1_ratio=1.0, penalizer=weights)
cph.fit(rossi, "week", "arrest")
i += 1
return cph.params_
```

In the above code, we repeatedly solve a new \(L_1\) problem with updated penalizer weights. This, according to the authors and my own assumption that it can be extended easily to the Cox model, gives us our \(L_{1/2}\) solution. Graphically, we can vary the `lambda_` parameter can see how the coefficient solutions change:

Compare this to our \(L_1\) solution:

]]>$$ \min_{\theta} -\ell(\theta) + \lambda ||\theta||_p^p $$

where \(\ell\) is the log-likelihood and \(||\cdot||\) is the \(p\) norm. This family of problems is typically solved by calculus because both terms are easy to differentiate ( when \(p\) is an integer greater than 1). Actually this is backwards: we want to solve the MLE using calculus (because that's our hammer) and in order to add a penalizer term, we also need it to be differentiable. Hence why we historically used the 2-norm.

If we don't solve the optimization problem with calculus, well then we can be more flexible with our penalizer term. I took this liberty when I developed the optimizations in lifetimes, my Python library for recency/frequency analysis. The MLE in the model is too complicated to differentiate, so I use numerical methods to find the minimum (Nelder-Mead to be exact). Because of this, I am free to add any penalizer term I wish, deviating as I choose from the traditional 2-norm. I made the choice of \(\log\), specifically:

$$ \min_{\alpha, \beta} -\ell(\alpha, \beta) + \lambda \left( \log(\alpha) + \log(\beta) \right) $$

- My unknown parameters are strictly positive, so I don't need to worry about taking the log of a non-positive number.
- My parameters are of different scale. This is really important: the two parameters describe different phenomena, and one typical comes out an order of magnitude larger than the other. A 2-norm penalizer term would scale the larger of the two down more than the smaller of the two. (Why? Because the square of a number increases faster the larger the number is, so larger numbers increase the overall penalizer term more) Instead, if I choose the \(\log\) penalizer term, this would not happen. For example, if I double either the smaller term or the larger term, the effect on the overall penalizer term is the same. So \(\log\) works better when I have parameters on different scales.

This was my logic when I first developed lifetimes. Things were going well, until I started noticing some datasets that would produce unstable convergence *only* when the penalizer coefficient \(\lambda\) was positive: it was driving some variables to nearly 0. How could this be? Shouldn't any positive penalizer coefficient *help* convergence? For this, we'll take two perspectives of this problem.

I probably don't need to say it, but the log of a value less than 1 is negative. More extreme, the log of a *very* small value is *very very* negative (because the rate of change of log near zero gets larger as we approach the asymptote). Thus, during optimization, when a parameter starts to get small, the overall penalizer term starts to gain momentum. In fact, the optimizer starts to shrink a particular parameter to near 0 because that really really helps the overall optimization.

This is obviously not what I wanted. Sure, I wanted to keep values from being too large, but I certainly did not want to reduce parameters to near zero! So it made sense that when I had \(\lambda\) equal to 0 I did not observe this behaviour.

On the other extreme, the \(\log\) penalizer is kinda a terrible penalizer against large values too. An order of magnitude increase in a parameter barely makes a difference in the log of it! It's an extremely sub-linear function, so it doesn't really penalize large parameter sizes well.

As noted in a chapter in Bayesian Methods for Hackers, there is a really beautiful and useful relationship between MLE penalizer terms and Bayesian priors. Simply, it comes down to that the prior is equivalent to the negative exponential of the penalizer term. Thus, the 2-norm penalizer term is a Normal prior on the unknowns; the 1-norm penalizer term is a Laplace prior, and so on. What is then our \(\log\) prior? Well, it's a \(\exp(-\log(\theta)) = \frac{1}{\theta}\) prior on \(\theta\). This is strange, no? It's an improper prior, and it's in fact a Jeffery's prior, so I'm basically saying "I have no idea what this scale parameter should be" - not what I want to be doing.

As much as I like the scale-free property of the \(\log\), it's time to say goodbye to it in favor of another. I think for my purposes, I'll try the Laplace prior/1-norm penalizer as it's a nice balance between disallowing extremely large values and not penalizing the largest parameter too strongly.

Update: I actually went with the square penalizer in the end. Why? The 1-norm was still sending too many values to zero, when I really felt strongly that no value should be zero.

]]>In the 00's, L1 penalties were all the rage in statistics and machine learning. Since they induced sparsity in fitted parameters, they were used as a variable selection method. Today, with some advanced models having tens of *billions* of parameters, sparsity isn't as useful, and the L1 penalty has dropped out of fashion.

However, most teams aren't using billion parameter models, and smart data scientists work with simple models initially. Below is how we implemented an L1 penalty in the Cox regression model.

The log-likelihood we wish to maximize looks like:

$$ll^*(\theta, x) = \sum_i^N ll(\theta, x_i) - \lambda||\theta||_1$$

where \(||\cdot ||_1\) is the sum of absolute values of the parameter vector \(\theta\). With L2 penalties, the penalty term is differentiable, and we can easily find the gradient w.r.t. the parameter vector, which enables us to use iterative solvers to solve this optimization problem. However, when our penalty is L1, we don't have a smooth derivative.

We solve this by replacing \(||\cdot ||_1\) with a smooth version:

$$\text{softabs}(\theta, a) = \frac{1}{a} \log(1 + \exp(-a\theta)) + \frac{1}{a}\log(1 + \exp(a\theta))$$

As \(a→\inf\), this converges to the absolute value of \(\theta\). This smooth absolute value is differentiable everywhere too. With that in mind, we can use our iterative solvers again, and increase \(a\) each iteration. I don't think there is any simple solution to choose \(a\), but I've found that increasing it exponentially works well.

We can combine this L1 penalty with an L2 penalty to get the elastic-net penalty, introduce a new parameter \(\rho\) to control weighting between the two:

$$ll^*(\theta, x, a) = \sum_i^N ll(\theta, x_i) - \lambda(\rho||\theta||_2^2 + (1-\rho) \text{softabs}(\theta, a))$$

And remember: cool kids don't take derivatives by hand! Why waste time and make mistakes trying to compute the first and second derivative of this penalty - let autograd do it!

```
from autograd import elementwise_grad
from autograd import numpy as anp
def soft_abs(x, a):
return 1 / a * (anp.logaddexp(0, -a * x) + anp.logaddexp(0, a * x))
def penalizer(beta, a, lambda_, l1_ratio):
```

return lambda_ * (l1_ratio * soft_abs(beta, a).sum() + (1 - l1_ratio) * (beta ** 2).sum())
d_penalizer = elementwise_grad(penalizer)
dd_penalizer = elementwise_grad(elementwise_grad(penalizer)
i = 0
while converging:
i += 1
h, g, ll = get_gradients(X, beta)
ll -= penalizer(beta, 1.3 ** i, 0.1, 0.5)
g -= d_penalizer(beta, 1.3 ** i, 0.1, 0.5)
h[np.diag_indices(d)] -= dd_penalizer(beta, 1.3 ** i, 0.1, 0.5)
# update beta and converging logic! Not shown here.

In lifelines 0.24+, we introduced the L1 penalty for Cox models. We can visualize the sparsity effect of the L1 penalty as we increase the `penalizer`

term:

```
from lifelines import CoxPHFitter
from lifelines.datasets import load_rossi
rossi = load_rossi()
results = {}
for p in np.linspace(0.001, 0.2, 40):
cph = CoxPHFitter(l1_ratio=1., penalizer=p).fit(rossi, "week", "arrest")
results[p] = cph.params_
pd.DataFrame(results).T.plot()
```

]]>`CoxTimeVaryingFitter`

, is used for time-varying datasets. Time-varying datasets require a more complicated algorithm, one that works by iterating over all unique times and "pulling out" the relevant rows associated with that time. It's a slow algorithm, as it requires lots of Python/Numpy indexing, which gets worse as the dataset size grows. Call this algorithm `CoxPHFitter`

, is used for static datasets. `CoxPHFitter`

uses a simpler algorithm that iterates over every row once only, and requires minimal indexing. Call this algorithm The strange behaviour I noticed was that, for my benchmark static dataset, the more-complicated `CoxTimeVaryingFitter`

was actually *faster* than my simpler `CoxPHFitter`

model. Like almost twice as fast.

The thought occurred to me that iterating over all unique times has an advantage versus iterating over all rows *when the cardinality of times is small relative to the number of rows*. That is, when there are lots of ties in the dataset, our batch algorithm should perform faster. In one extreme limit, when there is no variation in times, the batch algorithm will perform a single loop and finish. However, in the other extreme limit where all times are unique, our batch algorithm does expensive indexing too often.

This means that the algorithms will perform differently on different datasets (however the returned results should still be identical). And the magnitude of the performances is like a factor of 2, if not more. But given a dataset at runtime, how can I know which algorithm to choose? It would be unwise to let the untrained user decide - they should be abstracted away from this decision.

As a first step, I wanted to know *how* often the batch algorithm outperformed the single algorithm. To do this, I needed to create datasets with different sizes and varying fractions of tied times. For the latter characteristic, I defined the (very naive) statistic as:

$$\text{n unique times} = \text{frac}\; \cdot \; \text{n rows} $$

Once dataset creation was done, I generated many combinations and timed the performance of both the batch algorithm (now natively ported to `CoxPHFitter`

) and the single algorithm. A sample of the output is below (the time units are seconds).

N |
frac |
batch |
single |

432 | 0.010 | 0.175 | 0.249 |

432 | 0.099 | 0.139 | 0.189 |

432 | 0.189 | 0.187 | 0.198 |

... | ... | ... | ... |

So, for 432 rows, and a very high number of ties (i.e. low count of unique times) we see that the batch algorithm performance of 0.18 seconds vs 0.25 seconds for the single algorithm. Since I'm only comparing two algorithms and I'm interested in the faster one, the *ratio* of batch performance to single performance is just as meaningful. Here's all the raw data.

Looking at the raw data, it's not clear what the relationship is between N, frac, and ratio. Let's plot the data instead:

What we can see is that the ratio variable increases *almost* linearly with N and frac. Almost linearly. At this point, one idea is to compute both the statistics (N, frac) for a given dataset (at runtime, it's cheap), and *predict* what its ratio is going to be. If that value is above 1, use single algorithm, else use batch algorithm.

We can run a simple linear regression against the dataset of runtimes, and we get the following:

```
import statsmodels.api as sm
X = results[["N", "frac"]]
X = sm.add_constant(X)
Y = results["ratio"]
model = sm.OLS(Y, X).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: ratio R-squared: 0.933
Model: OLS Adj. R-squared: 0.931
Method: Least Squares F-statistic: 725.8
Date: Thu, 03 Jan 2019 Prob (F-statistic): 3.34e-62
Time: 13:11:37 Log-Likelihood: 68.819
No. Observations: 108 AIC: -131.6
Df Residuals: 105 BIC: -123.6
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.3396 0.030 11.367 0.000 0.280 0.399
N 4.135e-05 4.63e-06 8.922 0.000 3.22e-05 5.05e-05
frac 1.5038 0.041 37.039 0.000 1.423 1.584
==============================================================================
Omnibus: 3.371 Durbin-Watson: 1.064
Prob(Omnibus): 0.185 Jarque-Bera (JB): 3.284
Skew: -0.166 Prob(JB): 0.194
Kurtosis: 3.787 Cond. No. 1.77e+04
==============================================================================
```

Plotting this *plane-of-best-fit:*

We can see that the fit is pretty good, but there is some non-linearities in the second figure we aren't capturing. We should expect non-linearities too: in the batch algorithm, the average batch size is (N * frac) data points, so this interaction should be a factor in the batch algorithm's performance. Let's include that interaction in our regression:

```
import statsmodels.api as sm
results["N * frac"] = results["N"] * results["frac"]
X = results[["N", "frac", "N * frac"]]
X = sm.add_constant(X)
Y = results["ratio"]
model = sm.OLS(Y, X).fit()
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: ratio R-squared: 0.965
Model: OLS Adj. R-squared: 0.964
Method: Least Squares F-statistic: 944.4
Date: Thu, 03 Jan 2019 Prob (F-statistic): 2.89e-75
Time: 13:16:48 Log-Likelihood: 103.62
No. Observations: 108 AIC: -199.2
Df Residuals: 104 BIC: -188.5
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 0.5465 0.030 17.941 0.000 0.486 0.607
N -1.187e-05 6.44e-06 -1.843 0.068 -2.46e-05 9.05e-07
frac 1.0899 0.052 21.003 0.000 0.987 1.193
N * frac 0.0001 1.1e-05 9.702 0.000 8.47e-05 0.000
==============================================================================
Omnibus: 10.775 Durbin-Watson: 1.541
Prob(Omnibus): 0.005 Jarque-Bera (JB): 21.809
Skew: -0.305 Prob(JB): 1.84e-05
Kurtosis: 5.115 Cond. No. 3.43e+04
==============================================================================
```

Looks like we capture more of the variance (\(R^2\)), and the optical-fit looks better too! So where are we?

- Given a dataset, I can compute its statistics, N and frac, at runtime.
- I enter this into my linear model (with an interaction term), and it predicts the ratio of batch performance to single performance
- If this prediction is greater than 1.0, I choose single, else batch.

This idea was recently implemented in *lifelines*, and with some other optimizations to the batch algorithm, we see a 60% speedup on some datasets!

- This is a binary problem, batch vs single, so why not use logistic regression? Frank Harrell, one of the greatest statisticians, would not be happy with that. He advocates against the idea of false dichotomies in statistics (ex: rate > some arbitrary threshold, unnecessary binning of continuous variables, etc.). This is a case of that: we should be modelling the ratio and not the sign. In doing so, I retain maximum information for the algorithm to use.
- More than 2 algorithms to choose from? Instead of modelling the ratio, I can model the performance per algorithm, and choose the algorithm that is associated with the smallest prediction.
- L1, L2-Penalizers? Cross-validation? IDGAF. These give me minimal gains. Adding the interaction term was borderline going overboard.
- There is an asymmetric cost of being wrong I would like to model. Choosing the batch algorithm incorrectly could have much worse consequences on performance than if I chose the single algorithm incorrectly. To model this, instead of using a linear model with squared loss, I could use a quantile regression model with asymmetric loss.

¹ ugh, naming is hard.

]]>The bacteria *C. Botulinum *is responsible for creating one of the most dangerous chemicals known to man: botulinum toxin. If ingested, incredibly small amounts of this toxin can kill even a healthy person. Thankfully, food scientists and microbiologists have developed ways to control *C. Botulinum.* Any of the application of high acidity, high salinity, low/high temperature, oxygen-exposure can slow the growth of the bacteria, and in extreme amounts, destroy the bacteria. However, for food purposes, it is sufficient to slow the growth of the bacteria, typically by extending its *lag period *to an extremely long time. The lag period is a phase where a microorganism becomes accustomed to its new environment, before starting cell division (and entering the *exponential phase* of growth) - see figure below. For example, the reason why vinegar is used in home canning is to create an environment that is unfavourable (too acidic) for any remaining microbes (and there always are some) won't start multiplying. Vinegar isn't necessarily killing the bacteria, just extending its lag phase. Improper acidification in home canning is actually the cause of most botulism outbreaks in the Western world.

The idea of hurdle technology is to create more than one unfavourable growth condition for bacteria (or their spores). This idea is that either there is a super-additive negative effect on the growth, or that one condition can be "relaxed" to improve sensory characteristics (ex: less vinegar, but more salt) and still achieve the desired control.

Hurdle technology is present in almost all foods, typically disguised as some preservation technique. Consider cheese: it has low water activity (moisture), stored at cool temperatures, and high acidity. All of these conditions are hurdles.

One concern that I often see beginner fermenters worry about is botulism. This is understandable: it's somewhat uncomfortable leaving a jar of vegetables out of the fridge for weeks, and then eating it. This goes against everything we have been taught about food safety. And though rare, botulism is deadly enough that the expected risk is high enough to cause worry.

However, hurdle technology is at play here. The idea is to use multiple hurdles, in this case salt, acid, and enough competing microbes, to effectively pause any botulism bacteria in their lag period. The salt is added in the brine, and the acid can be provided by the lactic acid bacteria or manually added to the brine. For brines below 6% NaCl, which almost all fermentation brines are, the lactic acid bacteria have essentially no lag period and very quickly enter their exponential phase [1]. However, *C. Botulinum *can survive moderate amounts of salt as well, and if present, could also start multiplying. What we'd like is to create an environment that extends the lag period of *C. Botulinum* far enough such that the lactic acid bacteria can completely out compete any *C. Botulinum *present. Let's get some data.

The data comes from a 1983 paper on the effects of salt concentration and pH on *C. Botulinum *[2]*. *Below is a screenshot of the conditions of the trials (first and second columns) and observed lag period (third column).

We can see that we actually don't have many *exact* observations. Most observations are `<1`

, meaning that the authors didn't observe the lag period end exactly, but instead noted that it happened sometime within the first day. Similarly, for some pH and salt concentrations, the lag period went past the authors' timeline of 85 days, so they recorded this as `>85`

. This type of data is considered censored, so let's use survival analysis to analyze the relationship between pH, salt and lag periods.

We are interested in determining the survival distribution of the lag period, conditional on pH and salt concentration. Don't get confused about the "survival" part here: we are not measuring survival of bacteria - we are using survival analysis, a technique for duration data, to model how long the bacteria is in its lag period. We have data that is both right-censored (`>85`

) and left-censored (`<1`

). We can encode this data by using two columns, one for a lower bound (which may be 0) and one for an upper bound (which may be infinity). For exact measurements, the lower and upper bound are equal:

```
from lifelines.datasets import load_c_botulinum_lag_phase
df = load_c_botulinum_lag_phase()
print(df)
```

```
NaCl % pH lower_bound_days upper_bound_days
0 0 7.0 0.0 1.0
1 0 6.5 0.0 1.0
2 0 6.0 0.0 1.0
3 0 5.5 0.0 1.0
4 0 5.0 2.0 2.0
5 2 7.0 0.0 1.0
6 2 6.5 0.0 1.0
7 2 6.0 0.0 1.0
8 2 5.5 0.0 1.0
9 2 5.0 85.0 inf
10 3 7.0 0.0 1.0
11 3 6.5 0.0 1.0
12 3 6.0 0.0 1.0
13 3 5.5 2.0 2.0
14 3 5.0 85.0 inf
15 4 7.0 2.0 2.0
16 4 6.5 2.0 2.0
17 4 6.0 3.0 3.0
18 4 5.5 7.0 7.0
19 4 5.0 85.0 inf
20 6 7.0 85.0 inf
21 6 6.5 85.0 inf
22 6 6.0 85.0 inf
23 6 5.5 85.0 inf
24 6 5.0 85.0 inf
```

Let's use a Weibull survival regression model. That is, our functional form looks like:

$$ \begin{align}&P(\text{lag period} > t) \\ &= S(t\;|\;\text{pH},\text{salt%})\\ &= \exp{\left(-\left(\frac{t}{\lambda(\text{pH, salt%})}\right)^\rho\right)} \end{align}$$

where

$$ \lambda(\text{pH, salt%}) =\exp{(\beta_1\text{pH} + \beta_2\text{salt%} + \beta_0)} $$

The coefficients and \(\rho\) are to be estimated from the data. Fitting is done in *lifelines*:

```
from lifelines import *
aft = WeibullAFTFitter()
aft.fit_interval_censoring(
df,
lower_bound_col="lower_bound_days",
upper_bound_col="upper_bound_days")
aft.print_summary()
"""
<lifelines.WeibullAFTFitter: fitted with 25 total observations, 19 interval-censored observations>
lower bound col = 'lower_bound_days'
upper bound col = 'upper_bound_days'
event col = 'E_lifelines_added'
number of observations = 25
number of events observed = 6
log-likelihood = -25.70
time fit was run = 2020-04-12 13:20:30 UTC
---
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95%
lambda_ NaCl % 2.51 12.27 0.69 1.16 3.86 3.18 47.43
pH -4.87 0.01 1.48 -7.77 -1.98 0.00 0.14
_intercept 23.38 1.43e+10 7.16 9.35 37.41 11520.08 1.77e+16
rho_ _intercept -0.73 0.48 0.35 -1.41 -0.05 0.24 0.95
z p -log2(p)
lambda_ NaCl % 3.64 <0.005 11.81
pH -3.30 <0.005 10.01
_intercept 3.27 <0.005 9.84
rho_ _intercept -2.11 0.04 4.83
---
Log-likelihood ratio test = 31.94 on 2 df, -log2(p)=23.04
```

Looking at the `coef`

column above, we can see that lower pH increases the lag period and a higher salt concentration increases the lag period. If we suspect there to be super-additive effects between salt and pH, we can try adding an interaction term. After doing so, the fit isn't very good, so we leave it out.

Aside: I found a handful of bugs (convergence errors, API mistakes) when doing this project, which is great, because it makes *lifelines* more robust for other users.

Now that we can connect pH and salt concentration to "probability of botulism growth", we can build a rule that minimizes a fermentation's risk to botulism. Specifically, if we provide the lactic acid bacteria (which has no lag period) a sufficient head start, we can rest assured they will out compete the bad bacteria and further acidify the environment. Let's say we want the probability of botulism growth in the first 6 hours to be less than 1%, which is more than enough time to give good bacteria a head start. In math:

$$ \begin{align} & P(\text{lag period} < 0.25) < 0.01\\ & \iff P(\text{lag period} > 0.25) > 0.99 \\ & \iff S(0.25) > 0.99 \\ & \iff S(0.25) = \exp{\left(-\left(\frac{0.25}{\lambda(\text{pH, salt%})}\right)^\rho\right)} > 0.99 \\ & \iff \exp{\left(-\left(\frac{0.25}{\exp{(2.51 \cdot \text{salt%} -4.87 \cdot \text{pH} + 23.38)}}\right)^{0.48}\right)} > 0.99 \\ \end{align} $$

Performing some algebra, and some rounding to make the final formula nicer, the above inequality is satisfied if:

$$2 \cdot \text{pH} - \text{salt%} < 9$$

If your initial brine satisfies this, you can be quite certain that the risk of botulism is nil. For example, if you want to decrease salt concentration by a percent, you should lower the pH by 1/2 a unit. Keep in mind, this is for a very conservative application, and even if this inequality is not satisfied, this does not mean that botulism will develop. This formula can be used for the most worried of individuals. There are of course many other ways to reduce the risk of botulism, but that's for a non-Data Origami article!

]]>TLDR: upgrade lifelines for lots of improvements

`pip install -U lifelines`

During my time off, I’ve spent a lot of time improving my side projects so I’m at least *kinda* proud of them. I think lifelines, my survival analysis library, is in that spot. I’m actually kinda proud of it now.

A lot has changed in lifelines in the past few months, and in this post I want to mention some of the biggest additions and the stories behind them.

The Cox proportional hazard model is the workhorse of survival analysis. Almost all papers, unless good reason not to, use the Cox model. This was one of the first regression models added to lifelines, but it has always been too slow. It’s implemented in Numpy, but there was a tricky `for`

loop still in Python. I had ideas on how to turn that loop into a vectorized Numpy operation of matrix products, but there would have been an intermediate variable that created a `d x d x p`

tensor, where `d`

is the number of independent covariates and `p`

is the size of some subset of subjects (at worst, p could be equal to the number of subjects). This will quickly explode the amount of memory required and hence performance would degrade.

One night, on Twitter, I noticed some posts about how to use `einsum`

in PyTorch and Numpy. I had previously heard of it, but didn’t think it was something that was implemented in Numpy nor did I think it was something I could use. Turns out, `einsum`

is a way to do matrix multiplication *without* intermediate variables! How? Since a product of matrices is a just multiplications and sums, one can declaratively define what the end product should look like and the internals of `einsum`

will compute the intermediate operations at the C layer (I’m simplifying and my own understanding of this is shaky). Suffice to say, after racking my brain and lots of trial and error with `einsum`

, I could replace the tricky Python `for`

loop with `einsum`

! This resulted in a 3x performance increase. However, what I've gained in performance, I've lost in some readability of my code. I’ve put some references to `einsum`

in the bottom of this article.

The second significant improvement to the performance of the Cox model is using a meta-algorithm to select the fastest algorithm. There are two ways to compute the likelihood, and the performance of each is highly dependent on a few characteristics of the dataset, notably how many ties are in the dataset. After running some tests, I noticed that the delineation of when one was faster than the other was not clear, so a heuristic like `if num_ties > 10:`

would not be very effective. What I did instead was to generate hundreds of artificial dataset of varying ties and varying size, and measured the timing of both algorithms. I then fit a linear model to the *ratio* of the times, conditioned on the ties and size (and their interaction). It was a surprisingly good fit! So now in lifelines, at runtime, I compute some statistics about the incoming dataset, plug these values into a fitted linear model, and the result is the predicted ratio of timings between the two algorithms. I then choose which algorithm would be faster and continue on. The prediction is super fast (it’s a linear model after all), so there are no performance hits there. With this meta-algorithm, the lifelines Cox implementation is up to 3x faster for some datasets. I wrote up a full summary of the idea in a previous blog post [2].

Overall, the Cox model is now 10x faster than it was a few months ago. (Also, I only mention it here, but the Aalen Additive model is like 50x times faster, but most of those speed improvements where replacing Pandas with NumPy in critical points.)

One large gap in lifelines was checking the proportional hazards assumption, which is critical for any kind of inference-focused modeling (it matters less for prediction tasks). The author of the popular R survival library, Terry Therneau, has made massive contributions to survival analysis techniques, including a statistical test for non-proportionality. This test relies on an important residual of the Cox regression. While implementing this statistical test in lifelines, I realized there was a more general solution for handle *all* residuals, so I added functionality to compute the most common residuals.

These additions enabled a new, very user friendly function, `check_assumptions`

, which prints out potential proportionality violations in human readable format and offers advice on how to fix it. I also introduced residuals plots:

I think too many people are focused on deep learning. There, I said it, and probably you agree. However, some really cool technologies are falling out of that area that others can use. One of them is libraries that implement *automatic differentiation, *aka* autodiff*. This is like the holy grail for computational statisticians. Let me quickly explain why: given an arbitrary numerical function, you can automatically compute its exact gradient at any (valid) point. No rounding errors. No restrictions on the function. Just gradients. To quote directly from [1]:

/beginquote

Q: What’s the difference between autodiff and symbolic diff?

R: They are totally different. The biggest difference is that autodiff can differentiate algorithms, not just expressions. Consider the following code:

```
function f(x)
y = x;
for i=1…100
y = sin(x+y);
return y
```

Automatic differentiation can differentiate that, easily, in the same time as the original code. Symbolic differentiation would lead to a huge expression that would take much more time to compute.

Q: What about non-differentiable functions?

R: No problem, as long as the function is differentiable at the place you try to compute the gradient.

/endquote

So why am I so excited about this? I suffered through a week of frustrations and headaches trying to implement a log-normal survival model by hand. You can see my frustration here [3]. Also, my second-derivative calculations were abysmal, which meant we would be computing unreliable confidence intervals. The whole thing made me depressed. I was pointed to the Python library *autograd* [4], and after some wrestling, it was like a beam of heaven shown down on me. Computing gradients with autograd is easy, computing second-derivatives is easy, and performance is near identical. I imagined future generalizations and abstractions, and this has radically simplified my code base. One cool idea is that since we know the second derivative exactly, we can compute the variance matrix of the fitted parameters exactly, and we can use the delta method (and autograd) to compute variances of arbitrary functions of those fitted parameters. Here are two worked examples of the cool things you can do in lifelines now:

Autograd also enabled lifelines to implement accelerated failure time models, so users now have three new regression models to play with. I’ve been so happy with autograd that I’ve converted parts of my other Python library, lifetimes, to use it as well.

I’ve given the lifelines docs a serious facelift, probably doubled the amount of content, and edited large parts of it. Overall, I am much happier with the docs now. One addition I made was adding tracking of visitors’ searches on the docs site. This gives me some idea of where users might be confused, or where current docs structure is insufficient.

It's been a productive few months, and I think lifelines is in a good state. More of my attention now is on lifetimes (a new version was just released, by the way) and some other topics. Hope you enjoy lifelines!

- https://obilaniu6266h16.wordpress.com/2016/02/04/einstein-summation-in-numpy/
- https://rockt.github.io/2018/04/30/einsum

[3] https://github.com/CamDavidsonPilon/lifelines/issues/622

]]>