Menu
Cart

Napkin Folding — python

Distribution of the last value in a sum of Uniforms that exceeds 1

Posted by Cameron Davidson-Pilon at

While working on a problem, I derived an interesting result around sums of uniforms random variables. I wanted to record it here so I don't forget it (I haven't solved the more general problem yet!). Here's the summary of the result: Let \(S_n = \sum_{i=1}^n U_i \) be the sum of \(n\) Uniform random variables. Let \(N\) be the index of the first time the sum exceeds 1 (so \(S_{N-1} < 1\) and \(S_{N} \ge 1\)). The distribution of \(U_N\)...

Read more →

Poissonization of Multinomials

Posted by Cameron Davidson-Pilon at

Introduction I've seen some really interesting numerical solutions using a strategy called Poissonization, but Googling for it revealed very few resources (just some references in some textbooks that I don't have access to). So here it is: my notes and repository for Poissonization.  Theorem: Let \(N \sim \text{Poi}(\lambda)\) and suppose \(N=n, (X_1, X_2, ... X_k) \sim \text{Multi}(n, p_1, p_2, ..., p_k)\). Then, marginally, \(X_1, X_2, ..., X_k\) are are independent Poisson, with \(X_i \sim \text{Poi}(p_i \lambda)\). [1]  The proof is as follows. By...

Read more →

"Reversing the Python Data Analysis Lens" Video

Posted by Cameron Davidson-Pilon at

Last November, I was lucky enough to give the keynote at PyCon Canada 2015. Below is the abstract and video for it:  Python developers are commonly using Python as a tool to explore datasets - but what if we reverse that analysis lens back on to the developer? In this talk, Cam will use Python as a data analysis tool to explore Python developers and code. With millions of data points, mostly scraped from Github and Stackoverflow, we'll reexamine who...

Read more →

Bayesian Methods for Hackers release!

Posted by Cameron Davidson-Pilon at

Finally, after a few years writing and debugging, I'm proud to announce that the print copy of Bayesian Methods for Hackers is released! It has update content, including a brand new chapter on A/B testing, compared to the online version.      You can purchase it on Amazon today! 

Read more →

Bayesian M&M Problem in PyMC 2

Posted by Cameron Davidson-Pilon at

This Bayesian problem is from Allen Downey's Think Bayes book. I'll quote the problem here:  M&M’s are small candy-coated chocolates that come in a variety of colors. Mars, Inc., which makes M&M’s, changes the mixture of colors from time to time. In 1995, they introduced blue M&M’s. Before then, the color mix in a bag of plain M&M’s was 30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan. Afterward it was 24% Blue , 20% Green, 16%...

Read more →

Percentile and Quantile Estimation of Big Data: The t-Digest

Posted by Cameron Davidson-Pilon at

Suppose you are interested in the sample average of an array. No problem you think, as you create a small function to sum the elements and divide by the total count. Next, suppose you are interested in the sample average of a dataset that exists on many computers. No problem you think, as you create a function that returns the sum of the elements and the count of the elements, and send this function to each computer, and divide the sum of...

Read more →

IPython Startup Scripts

Posted by Cameron Davidson-Pilon at

I've been playing around with my IPython workflow for the past few weeks, and have found one I really like. It uses IPython's startup files, that are launched before the prompt opens up. This way I can load my favourite libraries, functions, etc., into my console. It also allows me to add my own %magic functions.  Today, I've opened up my startup scripts in a github repo, StartupFiles. The repo comes with some helper scripts too, to get your started:  ./bin/build_symlink: for...

Read more →

Joins in MapReduce Pt. 2 - Generalizing Joins in PySpark

Posted by Cameron Davidson-Pilon at

In the previous article in this series on Joins in MapReduce, we looked at how a traditional join is performed in a distributed map-reduce setting. I next want to generalize the idea of a join:

Read more →

[Video] Presentation on Lifelines - Survival Analysis in Python, Sept. 23, 2014

Posted by Cameron Davidson-Pilon at

I gave this talk on Lifelines, my project on survival analysis in Python, to the Montreal Python Meetup. It's a pretty good introduction to survival analysis, and how to use Lifelines. Enjoy!

Read more →

Joins in MapReduce Pt. 1 - Implementations in PySpark

Posted by Cameron Davidson-Pilon at

In traditional databases, the JOIN algorithm has been exhaustively optimized: it's likely the bottleneck for most queries. On the other hand, MapReduce, being so primitive, has a simpler implementation. Let's look at a standard join in MapReduce (with syntax from PySpark).

Read more →

Exploring Human Psychology with Mechanical Turk Data

Posted by Cameron Davidson-Pilon at

This blog post is a little different: it's a whole data collection and data analysis story. I become interested in some theories from behavioural economics, and wanted to verify them. So I used Mechanical Turkers to gather data, and then did some exploratory data analysis in Python and Pandas (bonus: I recorded my data analysis and visualization, see below). Prospect Theory and Expected Values It's clear that humans are irrational, but how irrational are they? After some research into behavourial...

Read more →

Using Census Data to Find Hot First Names

Posted by Cameron Davidson-Pilon at

We explore some cool data on first names and introduce a library for making this data available. We then use k-means to find the most trending names right now, and introduce some ideas on age inference of users. Freakonomics, the original Data Science book One of the first data science books, though it wasn't labelled that at the time, was the excellent book "Freakonomics" (2005). The authors were the first to publicise using data to solve large problems, or to...

Read more →

Replicating 538's plot styles in Matplotlib

Posted by Cameron Davidson-Pilon at

Nate Silver's FiveThirtyEight site has some aesthetically pleasing figures, ignoring the content of the plots for a moment: After pulling a few graphs locally, sampling colors, and crowd-sourcing the fonts used, I was able to come pretty close to replicating the style in Matplotlib styles. Here's an example (my figure dropped into an article on FiveThirtyEight.com) Another example using the replicated styles: So how to do it? [Edit: these steps are old, you can still use them, but there is...

Read more →