Menu
Cart

Napkin Folding — data science

Non-parametric survival function prediction

Posted by Cameron Davidson-Pilon at

As I was developing lifelines, I kept having a feeling that I was gradually moving the library towards prediction tasks. lifelines is great for regression models and fitting survival distributions, but as I was adding more and more flexible parametric models, I realized that I really wanted a model that would predict the survival function — and I didn't care how. This led me to the idea to use a neural net with \(n\) outputs, one output for each parameter...

Read more →

SaaS churn and piecewise regression survival models

Posted by Cameron Davidson-Pilon at

A software-as-a-service company (SaaS) has a typical customer churn pattern. During periods of no billing, the churn is relatively low compared to periods of billing (typically every 30 or 365 days). This results in a distinct survival function for customers. See below: kmf = KaplanMeierFitter().fit(df['T'], df['E']) kmf.plot(figsize=(11,6)); To borrow a term from finance, we clearly have different regimes that a customer goes through: periods of low churn and periods of high churn, both of which are predictable. This predictability and...

Read more →

Evolution of lifelines over the past few months

Posted by Cameron Davidson-Pilon at

TLDR: upgrade lifelines for lots of improvements pip install lifelines==0.22.1 During my time off, I’ve spent a lot of time improving my side projects so I’m at least kinda proud of them. I think lifelines, my survival analysis library, is in that spot. I’m actually kinda proud of it now. A lot has changed in lifelines in the past few months, and in this post I want to mention some of the biggest additions and the stories behind them. Performance...

Read more →

A real-life mistake I made about penalizer terms

Posted by Cameron Davidson-Pilon at

I made a very interesting mistake, and I wanted to share it with you because it's quite enlightening to statistical learning in general. It concerns a penalizer term in maximum-likelihood estimation. Normally, one deals only with the penalizer coefficient, that is, one plays around with \(\lambda\) in an MLE optimization like: $$ \min_{\theta} -\ell(\theta) + \lambda ||\theta||_p^p $$ where \(\ell\) is the log-likelihood and \(||\cdot||\) is the \(p\) norm. This family of problems is typically solved by calculus because both...

Read more →

Poissonization of Multinomials

Posted by Cameron Davidson-Pilon at

Introduction I've seen some really interesting probability & numerical solutions using a strategy called Poissonization, but Googling for it revealed very few resources (just some references in some textbooks that I don't have quick access to). Below are my notes and repository for Poissonization. After we introduce the theory, we'll do some examples. The technique relies on the following theorem: Theorem: Let \(N \sim \text{Poi}(\lambda)\) and suppose \(N=n, (X_1, X_2, ... X_k) \sim \text{Multi}(n, p_1, p_2, ..., p_k)\). Then, marginally, \(X_1, X_2, ..., X_k\)...

Read more →

Bayesian M&M Problem in PyMC 2

Posted by Cameron Davidson-Pilon at

This Bayesian problem is from Allen Downey's Think Bayes book. I'll quote the problem here:  M&M’s are small candy-coated chocolates that come in a variety of colors. Mars, Inc., which makes M&M’s, changes the mixture of colors from time to time. In 1995, they introduced blue M&M’s. Before then, the color mix in a bag of plain M&M’s was 30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan. Afterward it was 24% Blue , 20% Green, 16%...

Read more →

Percentile and Quantile Estimation of Big Data: The t-Digest

Posted by Cameron Davidson-Pilon at

Suppose you are interested in the sample average of an array. No problem you think, as you create a small function to sum the elements and divide by the total count. Next, suppose you are interested in the sample average of a dataset that exists on many computers. No problem you think, as you create a function that returns the sum of the elements and the count of the elements, and send this function to each computer, and divide the sum of...

Read more →

Dawkins on Saying "statistically, ... "

Posted by Cameron Davidson-Pilon at

Richard Dawkins, in his early book The Extended Phenotype, describes what he means when he says "statistically, X occurs". His original motivation was addressing a comment about gender, but it applies more generally:  If, then, it were true that the possession of a Y chromosome had a causal influence on, say, musical ability or fondness for knitting, what would this mean? It would mean that, in some specified population and in some specified environment, an observer in possession of information...

Read more →

[Video] Presentation on Lifelines - Survival Analysis in Python, Sept. 23, 2014

Posted by Cameron Davidson-Pilon at

I gave this talk on Lifelines, my project on survival analysis in Python, to the Montreal Python Meetup. It's a pretty good introduction to survival analysis, and how to use Lifelines. Enjoy!

Read more →

Using Census Data to Find Hot First Names

Posted by Cameron Davidson-Pilon at

We explore some cool data on first names and introduce a library for making this data available. We then use k-means to find the most trending names right now, and introduce some ideas on age inference of users. Freakonomics, the original Data Science book One of the first data science books, though it wasn't labelled that at the time, was the excellent book "Freakonomics" (2005). The authors were the first to publicise using data to solve large problems, or to...

Read more →

8 great data blogs to follow

Posted by Cameron Davidson-Pilon at

Below I've listed my favourite data analysis, data science, or otherwise technical blogs that I've learned a great deal from. Big +1's to the blogs' authors for providing all these ideas and intellectual property for public access. The list is in no particular order - and it's only blogs I remember, so if your blog isn't here, I may have just forgotten it ;) 1. Andrew Gelman's Statistical Modeling, Causal Inference, and Social Science Gelman is probably the leader in...

Read more →

Data's Use in the 21st Century

Posted by Cameron Davidson-Pilon at

The technological challenges, and achievements, of the 20th Century handed society powerful tools. Technologies like nuclear power, airplanes & automobiles, the digital computer, radio, internet and imaging technologies to name only a handful. Each of these technologies had disrupted the system, and each can be argued to be Black Swans (à la Nassim Taleb). In fact, for each technology, one could find a company killed by it, and a company that made its billions from it. What these technologies have...

Read more →