Menu
Cart

Napkin Folding — pyspark

Percentile and Quantile Estimation of Big Data: The t-Digest

Posted by Cameron Davidson-Pilon at

Suppose you are interested in the sample average of an array. No problem you think, as you create a small function to sum the elements and divide by the total count. Next, suppose you are interested in the sample average of a dataset that exists on many computers. No problem you think, as you create a function that returns the sum of the elements and the count of the elements, and send this function to each computer, and divide the sum of...

Read more →

Joins in MapReduce Pt. 2 - Generalizing Joins in PySpark

Posted by Cameron Davidson-Pilon at

In the previous article in this series on Joins in MapReduce, we looked at how a traditional join is performed in a distributed map-reduce setting. I next want to generalize the idea of a join:

Read more →

Joins in MapReduce Pt. 1 - Implementations in PySpark

Posted by Cameron Davidson-Pilon at

In traditional databases, the JOIN algorithm has been exhaustively optimized: it's likely the bottleneck for most queries. On the other hand, MapReduce, being so primitive, has a simpler implementation. Let's look at a standard join in MapReduce (with syntax from PySpark).

Read more →