A real-life mistake I made about penalizer terms

I made a very interesting mistake, and I wanted to share it with you because it's quite enlightening to statistical learning in general. It concerns a penalizer term in maximum-likelihood estimation. Normally, one deals only with the penalizer coefficient, that is, one plays around with $\lambda$ in an MLE optimization like:

$$ \min_{\theta} -\ell(\theta) + \lambda ||\theta||_p^p $$

where $\ell$ is the log-likelihood and $||\cdot||$ is the $p$ norm. This family of problems is typically solved by calculus because both terms are easy to differentiate ( when $p$ is an integer greater than 1). Actually this is backwards: we want to solve the MLE using calculus (because that's our hammer) and in order to add a penalizer term, we also need it to be differentiable. Hence why we historically used the 2-norm.

If we don't solve the optimization problem with calculus, well then we can be more flexible with our penalizer term. I took this liberty when I developed the optimizations in lifetimes, my Python library for recency/frequency analysis. The MLE in the model is too complicated to differentiate, so I use numerical methods to find the minimum (Nelder-Mead to be exact). Because of this, I am free to add any penalizer term I wish, deviating as I choose from the traditional 2-norm. I made the choice of $\log$, specifically:

$$ \min_{\alpha, \beta} -\ell(\alpha, \beta) + \lambda \left( \log(\alpha) + \log(\beta) \right) $$

First: Why is this a good idea?

My unknown parameters are strictly positive, so I don't need to worry about taking the log of a non-positive number.
My parameters are of different scale. This is really important: the two parameters describe different phenomena, and one typical comes out an order of magnitude larger than the other. A 2-norm penalizer term would scale the larger of the two down more than the smaller of the two. (Why? Because the square of a number increases faster the larger the number is, so larger numbers increase the overall penalizer term more) Instead, if I choose the $\log$ penalizer term, this would not happen. For example, if I double either the smaller term or the larger term, the effect on the overall penalizer term is the same. So $\log$ works better when I have parameters on different scales.

This was my logic when I first developed lifetimes. Things were going well, until I started noticing some datasets that would produce unstable convergence only when the penalizer coefficient $\lambda$ was positive: it was driving some variables to nearly 0. How could this be? Shouldn't any positive penalizer coefficient help convergence? For this, we'll take two perspectives of this problem.

Logs of small values

I probably don't need to say it, but the log of a value less than 1 is negative. More extreme, the log of a very small value is very very negative (because the rate of change of log near zero gets larger as we approach the asymptote). Thus, during optimization, when a parameter starts to get small, the overall penalizer term starts to gain momentum. In fact, the optimizer starts to shrink a particular parameter to near 0 because that really really helps the overall optimization.

This is obviously not what I wanted. Sure, I wanted to keep values from being too large, but I certainly did not want to reduce parameters to near zero! So it made sense that when I had $\lambda$ equal to 0 I did not observe this behaviour.

On the other extreme, the $\log$ penalizer is kinda a terrible penalizer against large values too. An order of magnitude increase in a parameter barely makes a difference in the log of it! It's an extremely sub-linear function, so it doesn't really penalize large parameter sizes well.

Bayesian perspective on penalizer terms

As noted in a chapter in Bayesian Methods for Hackers, there is a really beautiful and useful relationship between MLE penalizer terms and Bayesian priors. Simply, it comes down to that the prior is equivalent to the negative exponential of the penalizer term. Thus, the 2-norm penalizer term is a Normal prior on the unknowns; the 1-norm penalizer term is a Laplace prior, and so on. What is then our $\log$ prior? Well, it's a $\exp(-\log(\theta)) = \frac{1}{\theta}$ prior on $\theta$. This is strange, no? It's an improper prior, and it's in fact a Jeffery's prior, so I'm basically saying "I have no idea what this scale parameter should be" - not what I want to be doing.

Conclusion

As much as I like the scale-free property of the $\log$, it's time to say goodbye to it in favor of another. I think for my purposes, I'll try the Laplace prior/1-norm penalizer as it's a nice balance between disallowing extremely large values and not penalizing the largest parameter too strongly.

Update: I actually went with the square penalizer in the end. Why? The 1-norm was still sending too many values to zero, when I really felt strongly that no value should be zero.