# Highlights from lifelines v0.25.0

Today, the 0.25.0 release of *lifelines* was released. I'm very excited about some changes in this version, and want to highlight a few of them. Be sure to upgrade with:

`pip install lifelines==0.25.0`

### Formulas everywhere!

Formulas, which should really be called Wilkinson-style notation but everyone just calls them formulas, is a lightweight-grammar for describing additive relationships. If you have used R, you'll likely be familiar with formulas. They are less common in Python, so here's an example: Writing `age + salary`

is short-form for the expanded additive model: \(\beta_0 + \beta_1\text{age} + \beta_2\text{salary}\). Expanding on that, the grammar allows interaction terms quite easily: `age + salary + age : salary`

is

$$ \beta_0 + \beta_1\text{age} + \beta_2\text{salary} + \beta_3 \text{age $\cdot$ salary}$$

Actually, the previous use is so common that a short form of this entire interaction and monomial terms exists: `age * salary`

. These are just examples, but there is a large set of transformations, ways to handle categorical variables, etc.

This is just the grammar though, and a compiler is needed to parse the string, and then translate the parsed string into code. This code can transform an initial dataset into its transformed version that allows the \(\beta\)s to be estimated. For example, transforming the following raw dataset using the formula `age * salary`

:

```
df = pd.DataFrame({
'age': [35, 36, 40, 25, 55],
'salary': [60, 35, 80, 50, 100]
})
```

becomes:

```
pd.DataFrame({
'age': [35, 36, 40, 25, 55],
'salary': [60, 35, 80, 50, 100],
'age:salary': [2100, 1260, 2400, 1750, 5500]
'Intercept': [1, 1, 1, 1, 1]
})
```

This new dataframe can be given to any regression library to fit the \(\beta\)s. In Python, libraries like Patsy and the new Formulaic are the parser + code-generator.

Anyways, *lifelines* previously requested that all transformations occur in a preprocessing step, and the final dataframe given to a *lifelines* model. This created some problems, however:

- The user had to learn Patsy in order to use formulas in
*lifelines*, which is a barrier to entry.

- In methods like
`plot_covariate_groups`

which relied on examining potentially more than one variable, the user had to manually recreate potential interaction terms or more complicated transformations.

- Users often had to
`drop`

columns from their dataframe prior to fitting with*lifelines*, rather than telling*lifelines*what they wanted to use in the regression. This led to some ugly code.

With *lifelines* v0.25.0, formulas are now native (though optional) to *lifelines* model (old code should still work as well):

```
from lifelines import CoxPHFitter
from lifelines.datasets import load_rossi
rossi = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi, "week", "arrest", formula="age + fin + prio + paro * mar")
cph.print_summary(columns=['coef', 'se(coef)', '-log2(p)'])
"""
<lifelines.CoxPHFitter: fitted with 432 total observations, 318 right-censored observations>
duration col = 'week'
event col = 'arrest'
baseline estimation = breslow
number of observations = 432
number of events observed = 114
partial log-likelihood = -659.49
time fit was run = 2020-07-27 15:33:44 UTC
---
coef se(coef) -log2(p)
covariate
age -0.06 0.02 8.21
fin -0.37 0.19 4.19
prio 0.10 0.03 10.96
paro -0.10 0.20 0.73
mar -0.78 0.73 1.81
paro:mar 0.36 0.84 0.57
---
Concordance = 0.63
Partial AIC = 1330.98
log-likelihood ratio test = 31.78 on 6 df
-log2(p) of ll-ratio test = 15.76
"""
```

However, the *real* strength in formulas is their ability to create *basis splines* easily. Basis splines are highly flexible non-linear transformations of a variable - they are essential in the modern statistical inference toolkit:

```
cph.fit(rossi, "week", "arrest", formula="age + fin + bs(prio, df=3)")
cph.print_summary(columns=['coef', 'se(coef)', '-log2(p)'])
"""
<lifelines.CoxPHFitter: fitted with 432 total observations, 318 right-censored observations>
duration col = 'week'
event col = 'arrest'
baseline estimation = breslow
number of observations = 432
number of events observed = 114
partial log-likelihood = -659.88
time fit was run = 2020-07-27 15:36:34 UTC
---
coef se(coef) -log2(p)
covariate
age -0.07 0.02 9.91
fin -0.32 0.19 3.49
bs(prio, df=3)[0] 1.41 0.96 2.82
bs(prio, df=3)[1] -0.18 1.02 0.22
bs(prio, df=3)[2] 2.82 0.81 11.06
---
Concordance = 0.63
Partial AIC = 1329.76
log-likelihood ratio test = 31.00 on 5 df
-log2(p) of ll-ratio test = 16.70
"""
```

Importantly, the new transform logic in *lifelines* is extended to the `predict`

and plotting methods too.

For models that have more than one parameter, like `WeibullAFTFitter`

, formulas can be crafted for each parameter.

```
from lifelines import WeibullAFTFitter
wf = WeibullAFTFitter()
wf.fit(rossi, "week", "arrest", formula="age + fin + paro * mar", ancillary="age * fin")
wf.print_summary(columns=['coef', 'se(coef)', '-log2(p)'])
"""
<lifelines.WeibullAFTFitter: fitted with 432 total observations, 318 right-censored observations>
duration col = 'week'
event col = 'arrest'
number of observations = 432
number of events observed = 114
log-likelihood = -681.82
time fit was run = 2020-07-27 16:49:31 UTC
---
coef se(coef) -log2(p)
param covariate
lambda_ Intercept 2.19 0.68 9.59
age 0.11 0.03 10.13
fin 0.12 0.19 0.97
paro 0.17 0.14 2.19
mar 0.33 0.53 0.93
paro:mar -0.14 0.61 0.29
rho_ Intercept 1.36 0.41 9.98
age -0.05 0.02 7.25
fin -0.38 0.54 1.04
age:fin 0.02 0.02 1.74
---
Concordance = 0.62
AIC = 1383.65
log-likelihood ratio test = 29.60 on 8 df
-log2(p) of ll-ratio test = 11.97
"""
```

### New KMunicate plots!

One of my favourite types of research articles are on changes to improving scientific understanding and communication. Last year, a paper, by T. Morris *et al*., came out that surveyed statisticians, clinicians and stakeholders on how to better communicate the workhorse of survival analysis, Kaplan-Meier (KM) curves. A number of potential options were shown to the over 1100 survey participants, and ratings to each option were made. The survey results show two changes that could be made to improve understanding of KM curves.

1. Always show confidence intervals around the curves (*lifelines* does this by default)

2. Present all the summary information at the bottom. Many KM curves would present a*t-risk* numbers, and *lifelines* had this option as well:

But the participants really liked *all* summary information, including deaths and censorship, not just at-risk. *lifelines* now presents all this information:

It's not displayed by default (that may change), but with the `at_risk_counts`

kwarg in the call to `KaplanMeierFitter.plot`

.

This small change has the potential to be a massive improvement in understanding these plots. I'm so happy I saw this paper come across my desk.

### Performance improvements

Performance improvements was actually part of a release a few weeks back and not the v0.25.0, but I wanted to highlight it anyways. I found a bug in the switching algorithm that we use for choosing which algorithm to run for the Cox model (see post on how that's done here). This bug made the choice of algorithm sub-optimal, specifically for very large datasets. After fixing the bug, `CoxPHFitter`

is able to process millions of rows in less than a tenth of a second (not to trash on it, but just for comparison: R takes on the order of seconds, and the core algorithm is written in C). This is probably one of the fastest Cox-Efron models, and it's written only in Python. This makes me really proud.

### Conclusion

I'm really happy with where this release landed. There were lots of ups and downs in code quality and structure, but I feel like I settled on something nice. Formulas are a big deal, and will take *lifelines* to the next level. Future release will have support for partial-effect plots, and more.

You can see all the improvements, and important API changes, in v0.25.0 here.