Menu
Cart

Napkin Folding — penalties

An L½ penalty in Cox Regression

Posted by Cameron Davidson-Pilon at

Following up from a previous blog post where we explored how to implement an \(L_1\) and elastic net penalty to induce sparsity, a paper, by Xu Z B, Zhang H, Wang Y, et al., explores what a \(L_{1/2}\) penalty is and how to implement it. But first, I think we are familiar with an \(L_1\) penalty, but what is an \(L_0\) penalty then? If you work out the math, it is a penalty that counts the number of non-zero coefficients, independent of the magnitude of the coefficients: $$ll^*(\theta, x) =...

Read more →

L₁ Penalty in Cox Regression

Posted by Cameron Davidson-Pilon at

In the 00's, L1 penalties were all the rage in statistics and machine learning. Since they induced sparsity in fitted parameters, they were used as a variable selection method. Today, with some advanced models having tens of billions of parameters, sparsity isn't as useful, and the L1 penalty has dropped out of fashion. However, most teams aren't using billion parameter models, and smart data scientists work with simple models initially. Below is how we implemented an L1 penalty in the...

Read more →

SaaS churn and piecewise regression survival models

Posted by Cameron Davidson-Pilon at

A software-as-a-service company (SaaS) has a typical customer churn pattern. During periods of no billing, the churn is relatively low compared to periods of billing (typically every 30 or 365 days). This results in a distinct survival function for customers. See below: kmf = KaplanMeierFitter().fit(df['T'], df['E']) kmf.plot(figsize=(11,6)); To borrow a term from finance, we clearly have different regimes that a customer goes through: periods of low churn and periods of high churn, both of which are predictable. This predictability and...

Read more →

A real-life mistake I made about penalizer terms

Posted by Cameron Davidson-Pilon at

I made a very interesting mistake, and I wanted to share it with you because it's quite enlightening to statistical learning in general. It concerns a penalizer term in maximum-likelihood estimation. Normally, one deals only with the penalizer coefficient, that is, one plays around with \(\lambda\) in an MLE optimization like: $$ \min_{\theta} -\ell(\theta) + \lambda ||\theta||_p^p $$ where \(\ell\) is the log-likelihood and \(||\cdot||\) is the \(p\) norm. This family of problems is typically solved by calculus because both...

Read more →