Menu
Cart

Napkin Folding — spark

Searching through distributed datasets: The Mod-Binary Search

Posted by Cameron Davidson-Pilon at

On a not-too-unusual day, one of my Spark jobs failed in production. Typically this means there was a row of bad data that entered into the job. As I’m one to write regression tests, this “type” of bad had likely never been seen before, so I needed to inspect the individual offending row (or rows). Typically debug steps include: Manually inspecting all the recent data, either by hand or on a local machine. The failed job might print the offending...

Read more →

Joins in MapReduce Pt. 2 - Generalizing Joins in PySpark

Posted by Cameron Davidson-Pilon at

In the previous article in this series on Joins in MapReduce, we looked at how a traditional join is performed in a distributed map-reduce setting. I next want to generalize the idea of a join:

Read more →

Joins in MapReduce Pt. 1 - Implementations in PySpark

Posted by Cameron Davidson-Pilon at

In traditional databases, the JOIN algorithm has been exhaustively optimized: it's likely the bottleneck for most queries. On the other hand, MapReduce, being so primitive, has a simpler implementation. Let's look at a standard join in MapReduce (with syntax from PySpark).

Read more →