# Napkin Folding — penalties

## An L½ penalty in Cox Regression

Posted by **Cameron Davidson-Pilon** at

Following up from a previous blog post where we explored how to implement an \(L_1\) and elastic net penalty to induce sparsity, a paper, by Xu Z B, Zhang H, Wang Y, et al., explores what a \(L_{1/2}\) penalty is and how to implement it. But first, I think we are familiar with an \(L_1\) penalty, but what is an \(L_0\) penalty then? If you work out the math, it is a penalty that counts the number of non-zero coefficients, independent of the magnitude of the coefficients: $$ll^*(\theta, x) =...

## L₁ Penalty in Cox Regression

Posted by **Cameron Davidson-Pilon** at

In the 00's, L1 penalties were all the rage in statistics and machine learning. Since they induced sparsity in fitted parameters, they were used as a variable selection method. Today, with some advanced models having tens of billions of parameters, sparsity isn't as useful, and the L1 penalty has dropped out of fashion. However, most teams aren't using billion parameter models, and smart data scientists work with simple models initially. Below is how we implemented an L1 penalty in the...

## SaaS churn and piecewise regression survival models

Posted by **Cameron Davidson-Pilon** at

A software-as-a-service company (SaaS) has a typical customer churn pattern. During periods of no billing, the churn is relatively low compared to periods of billing (typically every 30 or 365 days). This results in a distinct survival function for customers. See below: kmf = KaplanMeierFitter().fit(df['T'], df['E']) kmf.plot(figsize=(11,6)); To borrow a term from finance, we clearly have different regimes that a customer goes through: periods of low churn and periods of high churn, both of which are predictable. This predictability and...

## A real-life mistake I made about penalizer terms

Posted by **Cameron Davidson-Pilon** at

I made a very interesting mistake, and I wanted to share it with you because it's quite enlightening to statistical learning in general. It concerns a penalizer term in maximum-likelihood estimation. Normally, one deals only with the penalizer coefficient, that is, one plays around with \(\lambda\) in an MLE optimization like: $$ \min_{\theta} -\ell(\theta) + \lambda ||\theta||_p^p $$ where \(\ell\) is the log-likelihood and \(||\cdot||\) is the \(p\) norm. This family of problems is typically solved by calculus because both...