Menu
Cart

Napkin Folding - Data Origami's Blog

Searching through distributed datasets: The Mod-Binary Search

Posted by Cameron Davidson-Pilon at

On a not-too-unusual day, one of my Spark jobs failed in production. Typically this means there was a row of bad data that entered into the job. As I’m one to write regression tests, this “type” of bad had likely never been seen before, so I needed to inspect the individual offending row (or rows). Typically debug steps include: Manually inspecting all the recent data, either by hand or on a local machine. The failed job might print the offending...

Read more →

A real-life mistake I made about penalizer terms

Posted by Cameron Davidson-Pilon at

I made a very interesting mistake, and I wanted to share it with you because it's quite enlightening to statistical learning in general. It concerns a penalizer term in maximum-likelihood estimation. Normally, one deals only with the penalizer coefficient, that is, one plays around with \(\lambda\) in an MLE optimization like: $$ \min_{\theta} -\ell(\theta) + \lambda ||\theta||_p^p $$ where \(\ell\) is the log-likelihood and \(||\cdot||\) is the \(p\) norm. This family of problems is typically solved by calculus because both...

Read more →

Distribution of the last value in a sum of Uniforms that exceeds 1

Posted by Cameron Davidson-Pilon at

While working on a problem, I derived an interesting result around sums of uniforms random variables. I wanted to record it here so I don't forget it (I haven't solved the more general problem yet!). Here's the summary of the result: Let \(S_n = \sum_{i=1}^n U_i \) be the sum of \(n\) Uniform random variables. Let \(N\) be the index of the first time the sum exceeds 1 (so \(S_{N-1} < 1\) and \(S_{N} \ge 1\)). The distribution of \(U_N\)...

Read more →

Poissonization of Multinomials

Posted by Cameron Davidson-Pilon at

Introduction I've seen some really interesting numerical solutions using a strategy called Poissonization, but Googling for it revealed very few resources (just some references in some textbooks that I don't have access to). So here it is: my notes and repository for Poissonization.  Theorem: Let \(N \sim \text{Poi}(\lambda)\) and suppose \(N=n, (X_1, X_2, ... X_k) \sim \text{Multi}(n, p_1, p_2, ..., p_k)\). Then, marginally, \(X_1, X_2, ..., X_k\) are are independent Poisson, with \(X_i \sim \text{Poi}(p_i \lambda)\). [1]  The proof is as follows. By...

Read more →

"Reversing the Python Data Analysis Lens" Video

Posted by Cameron Davidson-Pilon at

Last November, I was lucky enough to give the keynote at PyCon Canada 2015. Below is the abstract and video for it:  Python developers are commonly using Python as a tool to explore datasets - but what if we reverse that analysis lens back on to the developer? In this talk, Cam will use Python as a data analysis tool to explore Python developers and code. With millions of data points, mostly scraped from Github and Stackoverflow, we'll reexamine who...

Read more →

Bayesian Methods for Hackers release!

Posted by Cameron Davidson-Pilon at

Finally, after a few years writing and debugging, I'm proud to announce that the print copy of Bayesian Methods for Hackers is released! It has update content, including a brand new chapter on A/B testing, compared to the online version.      You can purchase it on Amazon today! 

Read more →

[Video] Mistakes I've Made talk at PyData 2015

Posted by Cameron Davidson-Pilon at

A presentation from PyData Seattle 2015 about all the mistakes I've made in data analysis and data science

Read more →

How can I use non-constructive proofs in data analysis?

Posted by Cameron Davidson-Pilon at

In mathematics, there are two classes of proof techniques: constructive and non-constructive. Constructive proofs will demonstrate how to build the object required. Its construction proves its existence, hence you are done. An example of this is proving that prime numbers are infinite using Euclid's argument: to find a prime number, you multiply together all the prime numbers seen thus far and add 1.  On the other hand, a non-constructive proof does not detail how to build the object, just states that it must...

Read more →

Bayesian M&M Problem in PyMC 2

Posted by Cameron Davidson-Pilon at

This Bayesian problem is from Allen Downey's Think Bayes book. I'll quote the problem here:  M&M’s are small candy-coated chocolates that come in a variety of colors. Mars, Inc., which makes M&M’s, changes the mixture of colors from time to time. In 1995, they introduced blue M&M’s. Before then, the color mix in a bag of plain M&M’s was 30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan. Afterward it was 24% Blue , 20% Green, 16%...

Read more →

Evolutionary Group Theory - or what happens when algebraic groups have sex.

Posted by Cameron Davidson-Pilon at

And now for something totally different. This is not data related. It's a paper I wrote about an intersection between group theory and evolutionary dynamics. Basically, what happens when groups have sex. Interested? Read on!   TLDR: You can find analogous group theory axioms in dynamical systems. Population Dynamics of Algebraic Groups We construct a dynamical population whose individuals are assigned elements from an algebraic group \(G\) and subject them to sexual reproduction. We investigate the relationship between the dynamical...

Read more →

Percentile and Quantile Estimation of Big Data: The t-Digest

Posted by Cameron Davidson-Pilon at

Suppose you are interested in the sample average of an array. No problem you think, as you create a small function to sum the elements and divide by the total count. Next, suppose you are interested in the sample average of a dataset that exists on many computers. No problem you think, as you create a function that returns the sum of the elements and the count of the elements, and send this function to each computer, and divide the sum of...

Read more →

Lifetimes: Measuring Customer Lifetime Value in Python

Posted by Cameron Davidson-Pilon at

Lifetimes is my latest Python project. Below is a summary, but you can also check out the source code on Github. Introduction As emphasized by P. Fader and B. Hardie, understanding and acting on customer lifetime value (CLV) is the most important part of your business's sales efforts. And (apparently) everyone is doing it wrong. Lifetimes is a Python library to calculate CLV for you. More generally, Lifetimes can be used to understand and predict future usage based on a...

Read more →

What The Name?!

Posted by Cameron Davidson-Pilon at

Kylea Parker and I over the holidays put together our first ever infographic! Now, her having a design background and myself having a stats background, we set out to do all infographics right: correct statistics and beautiful communication through design. I believe we achieved that.   The data analysis was done using demographica.   

Read more →

IPython Startup Scripts

Posted by Cameron Davidson-Pilon at

I've been playing around with my IPython workflow for the past few weeks, and have found one I really like. It uses IPython's startup files, that are launched before the prompt opens up. This way I can load my favourite libraries, functions, etc., into my console. It also allows me to add my own %magic functions.  Today, I've opened up my startup scripts in a github repo, StartupFiles. The repo comes with some helper scripts too, to get your started:  ./bin/build_symlink: for...

Read more →

Dawkins on Saying "statistically, ... "

Posted by Cameron Davidson-Pilon at

Richard Dawkins, in his early book The Extended Phenotype, describes what he means when he says "statistically, X occurs". His original motivation was addressing a comment about gender, but it applies more generally:  If, then, it were true that the possession of a Y chromosome had a causal influence on, say, musical ability or fondness for knitting, what would this mean? It would mean that, in some specified population and in some specified environment, an observer in possession of information...

Read more →

Joins in MapReduce Pt. 2 - Generalizing Joins in PySpark

Posted by Cameron Davidson-Pilon at

In the previous article in this series on Joins in MapReduce, we looked at how a traditional join is performed in a distributed map-reduce setting. I next want to generalize the idea of a join:

Read more →

[Video] Presentation on Lifelines - Survival Analysis in Python, Sept. 23, 2014

Posted by Cameron Davidson-Pilon at

I gave this talk on Lifelines, my project on survival analysis in Python, to the Montreal Python Meetup. It's a pretty good introduction to survival analysis, and how to use Lifelines. Enjoy!

Read more →

Joins in MapReduce Pt. 1 - Implementations in PySpark

Posted by Cameron Davidson-Pilon at

In traditional databases, the JOIN algorithm has been exhaustively optimized: it's likely the bottleneck for most queries. On the other hand, MapReduce, being so primitive, has a simpler implementation. Let's look at a standard join in MapReduce (with syntax from PySpark).

Read more →

Why Your Distribution Might be Long-Tailed

Posted by Cameron Davidson-Pilon at

I really like this below video explaining how a long-tailed distribution (also called powerlaw distributions, or fat-tailed distributions) can form naturally. In fact, I keep thinking about it and applying it to some statistical thinking. Long-tailed distributions are incredibly common in the social science: for example, we encounter them in the wealth distribution: few people control most of the wealth. social networks: celebrities have thousands of times more followers than the median user. revenue generated by businesses: Amazon is larger than...

Read more →

The Class Imbalance Problem in A/B Testing

Posted by Cameron Davidson-Pilon at

Introduction If you have been following this blog, you'll know that I employ Bayesian A/B testing for conversion tests (and see this screencast to see how it works). One of the strongest reasons for this is the interpretability of the analogous "p-value", which I call Confidence, defined as the probability the conversion rate of A is greater than B, $$\text{confidence} = P( C_A > C_B ) $$ Really, this is what the experimenter wants - an answer to: "what are...

Read more →

Exploring Human Psychology with Mechanical Turk Data

Posted by Cameron Davidson-Pilon at

This blog post is a little different: it's a whole data collection and data analysis story. I become interested in some theories from behavioural economics, and wanted to verify them. So I used Mechanical Turkers to gather data, and then did some exploratory data analysis in Python and Pandas (bonus: I recorded my data analysis and visualization, see below). Prospect Theory and Expected Values It's clear that humans are irrational, but how irrational are they? After some research into behavourial...

Read more →

Using Census Data to Find Hot First Names

Posted by Cameron Davidson-Pilon at

We explore some cool data on first names and introduce a library for making this data available. We then use k-means to find the most trending names right now, and introduce some ideas on age inference of users. Freakonomics, the original Data Science book One of the first data science books, though it wasn't labelled that at the time, was the excellent book "Freakonomics" (2005). The authors were the first to publicise using data to solve large problems, or to...

Read more →

8 great data blogs to follow

Posted by Cameron Davidson-Pilon at

Below I've listed my favourite data analysis, data science, or otherwise technical blogs that I've learned a great deal from. Big +1's to the blogs' authors for providing all these ideas and intellectual property for public access. The list is in no particular order - and it's only blogs I remember, so if your blog isn't here, I may have just forgotten it ;) 1. Andrew Gelman's Statistical Modeling, Causal Inference, and Social Science Gelman is probably the leader in...

Read more →

Replicating 538's plot styles in Matplotlib

Posted by Cameron Davidson-Pilon at

Nate Silver's FiveThirtyEight site has some aesthetically pleasing figures, ignoring the content of the plots for a moment: After pulling a few graphs locally, sampling colors, and crowd-sourcing the fonts used, I was able to come pretty close to replicating the style in Matplotlib styles. Here's an example (my figure dropped into an article on FiveThirtyEight.com) Another example using the replicated styles: So how to do it? [Edit: these steps are old, you can still use them, but there is...

Read more →

The Binary Problem and The Continuous Problem in A/B testing

Posted by Cameron Davidson-Pilon at

Introduction I feel like there is a misconception in performing A/B tests. I've seen blogs, articles, etc. that show off the result of an A/B test, something like "converted X% better". But this is not what the A/B test was actually measuring: an A/B test is measuring "which group is better" (the binary problem), not "how much better" (the continuous problem). In practice, here's what happens: the tester waits until the A/B test is over (hence solving the binary problem),...

Read more →

Data's Use in the 21st Century

Posted by Cameron Davidson-Pilon at

The technological challenges, and achievements, of the 20th Century handed society powerful tools. Technologies like nuclear power, airplanes & automobiles, the digital computer, radio, internet and imaging technologies to name only a handful. Each of these technologies had disrupted the system, and each can be argued to be Black Swans (à la Nassim Taleb). In fact, for each technology, one could find a company killed by it, and a company that made its billions from it. What these technologies have...

Read more →

Feature Space in Machine Learning

Posted by Cameron Davidson-Pilon at

Feature space refers to the \(n\)-dimensions where your variables live (not including a target variable, if it is present). The term is used often in ML literature because a task in ML is feature extraction, hence we view all variables as features. For example, consider the data set with: Target \(Y \equiv\) Thickness of car tires after some testing period Variables \(X_1 \equiv\) distance travelled in test \(X_2 \equiv\) time duration of test \(X_3 \equiv\) amount of chemical \(C\) in...

Read more →

Multi-Armed Bandits

Posted by Cameron Davidson-Pilon at

Preface: This example is a (greatly modified) excerpt from the open-source book Bayesian Methods for Hackers, currently being developed on Github Adapted from an example by Ted Dunning of MapR Technologies The Multi-Armed Bandit Problem Suppose you are faced with \(N\) slot machines (colourfully called multi-armed bandits). Each bandit has an unknown probability of distributing a prize (assume for now the prizes are the same for each bandit, only the probabilities differ). Some bandits are very generous, others not so...

Read more →

An algorithm to sort "Top" comments

Posted by Cameron Davidson-Pilon at

(originally posted on camdp.com) Preface: This example is a (greatly modified) excerpt from the open-source book Bayesian Methods for Hackers, currently being developed on Github ;) Why is sorting from "best" to "worst" so difficult? Consider ratings on online products: how often do you trust an average 5-star rating if there is only 1 reviewer? 2 reviewers? 3 reviewers? We implicitly understand that with such few reviewers that the average rating is not a good reflection of the true value...

Read more →

Interior designing with Machine Learning

Posted by Cameron Davidson-Pilon at

(Originally on camdp.com) While designing my new apartment, I found a very cool use of machine learning. Yes, that's right, you can use machine learning in interior design. As crazy as it sounds, it is completely legitimate. The Problem I like to show off my collection of books. And with nice new bookshelves in my apartment, I wanted to display the book collection in an interesting way. I've seen organizing books by colour and gradient, a visual trick I find...

Read more →

Least Squares Regression with L1 Penalty

Posted by Cameron Davidson-Pilon at

I'd like to discuss and demonstrate a really cool idea in machine learning and statistics. It's a simple idea: adding a constraint to an optimization problem, specifically a constraint on the sum of learned parameters, can have huge impacts on your interpretation, robustness and sanity. I must first introduce the family of functions we will be discussing. The L-norm family The family of L-norm penalty functions, \(L_p:R^d \rightarrow R\), is defined: $$ L_p( x ) = || x ||_p =...

Read more →