Highlights from lifelines v0.25.0
Today, the 0.25.0 release of lifelines was released. I'm very excited about some changes in this version, and want to highlight a few of them. Be sure to upgrade with:
pip install lifelines==0.25.0
Formulas everywhere!
Formulas, which should really be called Wilkinson-style notation but everyone just calls them formulas, is a lightweight-grammar for describing additive relationships. If you have used R, you'll likely be familiar with formulas. They are less common in Python, so here's an example: Writing age + salary
is short-form for the expanded additive model: \(\beta_0 + \beta_1\text{age} + \beta_2\text{salary}\). Expanding on that, the grammar allows interaction terms quite easily: age + salary + age : salary
is
$$ \beta_0 + \beta_1\text{age} + \beta_2\text{salary} + \beta_3 \text{age $\cdot$ salary}$$
Actually, the previous use is so common that a short form of this entire interaction and monomial terms exists: age * salary
. These are just examples, but there is a large set of transformations, ways to handle categorical variables, etc.
This is just the grammar though, and a compiler is needed to parse the string, and then translate the parsed string into code. This code can transform an initial dataset into its transformed version that allows the \(\beta\)s to be estimated. For example, transforming the following raw dataset using the formula age * salary
:
df = pd.DataFrame({
'age': [35, 36, 40, 25, 55],
'salary': [60, 35, 80, 50, 100]
})
becomes:
pd.DataFrame({
'age': [35, 36, 40, 25, 55],
'salary': [60, 35, 80, 50, 100],
'age:salary': [2100, 1260, 2400, 1750, 5500]
'Intercept': [1, 1, 1, 1, 1]
})
This new dataframe can be given to any regression library to fit the \(\beta\)s. In Python, libraries like Patsy and the new Formulaic are the parser + code-generator.
Anyways, lifelines previously requested that all transformations occur in a preprocessing step, and the final dataframe given to a lifelines model. This created some problems, however:
- The user had to learn Patsy in order to use formulas in lifelines, which is a barrier to entry.
- In methods like
plot_covariate_groups
which relied on examining potentially more than one variable, the user had to manually recreate potential interaction terms or more complicated transformations.
- Users often had to
drop
columns from their dataframe prior to fitting with lifelines, rather than telling lifelines what they wanted to use in the regression. This led to some ugly code.
With lifelines v0.25.0, formulas are now native (though optional) to lifelines model (old code should still work as well):
from lifelines import CoxPHFitter
from lifelines.datasets import load_rossi
rossi = load_rossi()
cph = CoxPHFitter()
cph.fit(rossi, "week", "arrest", formula="age + fin + prio + paro * mar")
cph.print_summary(columns=['coef', 'se(coef)', '-log2(p)'])
"""
<lifelines.CoxPHFitter: fitted with 432 total observations, 318 right-censored observations>
duration col = 'week'
event col = 'arrest'
baseline estimation = breslow
number of observations = 432
number of events observed = 114
partial log-likelihood = -659.49
time fit was run = 2020-07-27 15:33:44 UTC
---
coef se(coef) -log2(p)
covariate
age -0.06 0.02 8.21
fin -0.37 0.19 4.19
prio 0.10 0.03 10.96
paro -0.10 0.20 0.73
mar -0.78 0.73 1.81
paro:mar 0.36 0.84 0.57
---
Concordance = 0.63
Partial AIC = 1330.98
log-likelihood ratio test = 31.78 on 6 df
-log2(p) of ll-ratio test = 15.76
"""
However, the real strength in formulas is their ability to create basis splines easily. Basis splines are highly flexible non-linear transformations of a variable - they are essential in the modern statistical inference toolkit:
cph.fit(rossi, "week", "arrest", formula="age + fin + bs(prio, df=3)")
cph.print_summary(columns=['coef', 'se(coef)', '-log2(p)'])
"""
<lifelines.CoxPHFitter: fitted with 432 total observations, 318 right-censored observations>
duration col = 'week'
event col = 'arrest'
baseline estimation = breslow
number of observations = 432
number of events observed = 114
partial log-likelihood = -659.88
time fit was run = 2020-07-27 15:36:34 UTC
---
coef se(coef) -log2(p)
covariate
age -0.07 0.02 9.91
fin -0.32 0.19 3.49
bs(prio, df=3)[0] 1.41 0.96 2.82
bs(prio, df=3)[1] -0.18 1.02 0.22
bs(prio, df=3)[2] 2.82 0.81 11.06
---
Concordance = 0.63
Partial AIC = 1329.76
log-likelihood ratio test = 31.00 on 5 df
-log2(p) of ll-ratio test = 16.70
"""
Importantly, the new transform logic in lifelines is extended to the predict
and plotting methods too.
For models that have more than one parameter, like WeibullAFTFitter
, formulas can be crafted for each parameter.
from lifelines import WeibullAFTFitter
wf = WeibullAFTFitter()
wf.fit(rossi, "week", "arrest", formula="age + fin + paro * mar", ancillary="age * fin")
wf.print_summary(columns=['coef', 'se(coef)', '-log2(p)'])
"""
<lifelines.WeibullAFTFitter: fitted with 432 total observations, 318 right-censored observations>
duration col = 'week'
event col = 'arrest'
number of observations = 432
number of events observed = 114
log-likelihood = -681.82
time fit was run = 2020-07-27 16:49:31 UTC
---
coef se(coef) -log2(p)
param covariate
lambda_ Intercept 2.19 0.68 9.59
age 0.11 0.03 10.13
fin 0.12 0.19 0.97
paro 0.17 0.14 2.19
mar 0.33 0.53 0.93
paro:mar -0.14 0.61 0.29
rho_ Intercept 1.36 0.41 9.98
age -0.05 0.02 7.25
fin -0.38 0.54 1.04
age:fin 0.02 0.02 1.74
---
Concordance = 0.62
AIC = 1383.65
log-likelihood ratio test = 29.60 on 8 df
-log2(p) of ll-ratio test = 11.97
"""
New KMunicate plots!
One of my favourite types of research articles are on changes to improving scientific understanding and communication. Last year, a paper, by T. Morris et al., came out that surveyed statisticians, clinicians and stakeholders on how to better communicate the workhorse of survival analysis, Kaplan-Meier (KM) curves. A number of potential options were shown to the over 1100 survey participants, and ratings to each option were made. The survey results show two changes that could be made to improve understanding of KM curves.
1. Always show confidence intervals around the curves (lifelines does this by default)
2. Present all the summary information at the bottom. Many KM curves would present at-risk numbers, and lifelines had this option as well:
But the participants really liked all summary information, including deaths and censorship, not just at-risk. lifelines now presents all this information:
It's not displayed by default (that may change), but with the at_risk_counts
kwarg in the call to KaplanMeierFitter.plot
.
This small change has the potential to be a massive improvement in understanding these plots. I'm so happy I saw this paper come across my desk.
Performance improvements
Performance improvements was actually part of a release a few weeks back and not the v0.25.0, but I wanted to highlight it anyways. I found a bug in the switching algorithm that we use for choosing which algorithm to run for the Cox model (see post on how that's done here). This bug made the choice of algorithm sub-optimal, specifically for very large datasets. After fixing the bug, CoxPHFitter
is able to process millions of rows in less than a tenth of a second (not to trash on it, but just for comparison: R takes on the order of seconds, and the core algorithm is written in C). This is probably one of the fastest Cox-Efron models, and it's written only in Python. This makes me really proud.
Conclusion
I'm really happy with where this release landed. There were lots of ups and downs in code quality and structure, but I feel like I settled on something nice. Formulas are a big deal, and will take lifelines to the next level. Future release will have support for partial-effect plots, and more.
You can see all the improvements, and important API changes, in v0.25.0 here.