# Napkin Folding - Data Origami's Blog

## Highlights from lifelines v0.25.0

Posted by Cameron Davidson-Pilon at

Today, the 0.25.0 release of lifelines was released. I'm very excited about some changes in this version, and want to highlight a few of them. Be sure to upgrade with: pip install lifelines==0.25.0 Formulas everywhere! Formulas, which should really be called Wilkinson-style notation but everyone just calls them formulas, is a lightweight-grammar for describing additive relationships. If you have used R, you'll likely be familiar with formulas. They are less common in Python, so here's an example: Writing age +...

## An L½ penalty in Cox Regression

Posted by Cameron Davidson-Pilon at

## Exploring Human Psychology with Mechanical Turk Data

Posted by Cameron Davidson-Pilon at

This blog post is a little different: it's a whole data collection and data analysis story. I become interested in some theories from behavioural economics, and wanted to verify them. So I used Mechanical Turkers to gather data, and then did some exploratory data analysis in Python and Pandas (bonus: I recorded my data analysis and visualization, see below). Prospect Theory and Expected Values It's clear that humans are irrational, but how irrational are they? After some research into behavourial...

## Using Census Data to Find Hot First Names

Posted by Cameron Davidson-Pilon at

We explore some cool data on first names and introduce a library for making this data available. We then use k-means to find the most trending names right now, and introduce some ideas on age inference of users. Freakonomics, the original Data Science book One of the first data science books, though it wasn't labelled that at the time, was the excellent book "Freakonomics" (2005). The authors were the first to publicise using data to solve large problems, or to...

## 8 great data blogs to follow

Posted by Cameron Davidson-Pilon at

Below I've listed my favourite data analysis, data science, or otherwise technical blogs that I've learned a great deal from. Big +1's to the blogs' authors for providing all these ideas and intellectual property for public access. The list is in no particular order - and it's only blogs I remember, so if your blog isn't here, I may have just forgotten it ;) 1. Andrew Gelman's Statistical Modeling, Causal Inference, and Social Science Gelman is probably the leader in...

## Replicating 538's plot styles in Matplotlib

Posted by Cameron Davidson-Pilon at

Nate Silver's FiveThirtyEight site has some aesthetically pleasing figures, ignoring the content of the plots for a moment: After pulling a few graphs locally, sampling colors, and crowd-sourcing the fonts used, I was able to come pretty close to replicating the style in Matplotlib styles. Here's an example (my figure dropped into an article on FiveThirtyEight.com) Another example using the replicated styles: So how to do it? [Edit: these steps are old, you can still use them, but there is...

## The Binary Problem and The Continuous Problem in A/B testing

Posted by Cameron Davidson-Pilon at

Introduction I feel like there is a misconception in performing A/B tests. I've seen blogs, articles, etc. that show off the result of an A/B test, something like "converted X% better". But this is not what the A/B test was actually measuring: an A/B test is measuring "which group is better" (the binary problem), not "how much better" (the continuous problem). In practice, here's what happens: the tester waits until the A/B test is over (hence solving the binary problem),...

## Data's Use in the 21st Century

Posted by Cameron Davidson-Pilon at

The technological challenges, and achievements, of the 20th Century handed society powerful tools. Technologies like nuclear power, airplanes & automobiles, the digital computer, radio, internet and imaging technologies to name only a handful. Each of these technologies had disrupted the system, and each can be argued to be Black Swans (à la Nassim Taleb). In fact, for each technology, one could find a company killed by it, and a company that made its billions from it. What these technologies have...

## Feature Space in Machine Learning

Posted by Cameron Davidson-Pilon at

Feature space refers to the $$n$$-dimensions where your variables live (not including a target variable, if it is present). The term is used often in ML literature because a task in ML is feature extraction, hence we view all variables as features. For example, consider the data set with: Target $$Y \equiv$$ Thickness of car tires after some testing period Variables $$X_1 \equiv$$ distance travelled in test $$X_2 \equiv$$ time duration of test $$X_3 \equiv$$ amount of chemical $$C$$ in...

## Generating exponential survival data

Posted by Cameron Davidson-Pilon at

Suppose we interested in generating exponential survival times with scale parameter $$\lambda$$, and having $$\alpha$$ probability of censorship, $$0 \le \alpha < 1$$. This is actually, at least from what I tried, a non-trivial problem. I've derived a few algorithms: Algorithm 1  Generate $$T \sim \text{Exp}( \lambda )$$. If $$\alpha = 0$$, return $$(T, 1)$$.   Solve $$\frac{ \lambda h }{ \exp (\lambda h) -1 } = \alpha$$ for $$h$$.  Generate $$E \sim \text{TruncExp}( \lambda, h )$$, where $$\text{TruncExp}$$ is...

## Multi-Armed Bandits

Posted by Cameron Davidson-Pilon at

Preface: This example is a (greatly modified) excerpt from the open-source book Bayesian Methods for Hackers, currently being developed on Github Adapted from an example by Ted Dunning of MapR Technologies The Multi-Armed Bandit Problem Suppose you are faced with $$N$$ slot machines (colourfully called multi-armed bandits). Each bandit has an unknown probability of distributing a prize (assume for now the prizes are the same for each bandit, only the probabilities differ). Some bandits are very generous, others not so...