8 great data blogs to follow
Below I've listed my favourite data analysis, data science, or otherwise technical blogs that I've learned a great deal from. Big +1's to the blogs' authors for providing all these ideas and intellectual property for public access. The list is in no particular order - and it's only blogs I remember, so if your blog isn't here, I may have just forgotten it ;)
1. Andrew Gelman's Statistical Modeling, Causal Inference, and Social Science
Gelman is probably the leader in modern Bayesian inference - he's the author of the Bayesian Data Analysis, so popular that it can be referred to by its initials, BDA, and everyone knows what you are talking about. His blog is very active (and has an associated twitter account too) and he has great discussions on modelling, exposing bad analysis, MCMC, and statistical inference. One time he mentioned Bayesian Methods for Hackers and my heart melted.
Selected articles by Gelman
2. Simply Statistics
Jeff Leek and co. are doing a great job with Simply Statistics. The blog is less technical than Gelman's, and focuses more on where statistics fits in with science, big data and data science. Recently, they have embarked on an amazing project: replicating the data analysis in Piketty's "Capitalism in the 21st Century".
Selected articles from Simply Statistics
- R.A. Fisher is the most influential scientist ever
- Prediction: the Lasso vs. just using the top 10 predictors
- 10 things statistics taught us about big data analysis
3. Evan Miller's Blog, evanmiller.org
Miller's blog articles, no matter how old, keep appearing on popular news sites, and it's well deserved. I still recall how excited I got reading "How not to run a A/B test" for the first time. Recently, he's been playing around with Bayesian A/B testing and survival analysis too, so clearly he is awesome.
Selected articles from evanmiller.org
4. Rasmus Bååth's Research Blog, sumar.net
Rasmus blew my mind with his terrific articles on Bayesian testing (below). His writing style is very clean with lots of custom graphics - you can tell he takes his time writing his articles. Rasmus is also the author of Bayesian First Aid, a bayesian testing framework for R.
Selected articles from sumsar.net
5. Jake Vanderplas' Pythonic Perambulations
Vanderplas, who is likely a robot and doesn't sleep, has made great contributions to the Python ecosystem: he's the author of mpld3, a translation of matplotlib figures to D3 for ipython notebooks, the amazing xkcd matplotlib styles, and he's been a leader in teaching python data analysis through conferences and lectures. His blog is an extension of his work: amazing tutorials, projects, and all very readable. It's really really difficult to only pick a sample to present:
Selected articles from Python Permuations
- The Big Data brain drain
- Frequentism and Bayesianism: a practical introduction
- Dynamic Programming in Python
6. Allen Downey's Probably Overthinking It
Downey, probably the most prolific writer on this list, is the author of the "Think [Statistics, Python, Bayes, Complexity]" series. His blog is often his sketch pad before the book, and is full of fun articles. When learning survival analysis myself, I kept going back to his article (below) just to reinforce the application.
Selected articles from Probably Overthinking It
7. Abraham Flaxman's Healthy Algorithms
Without Flaxman's blog, I would probably not have understand Bayesian computations ideas. During my 2-day seclusion to grok Bayesian methods, and later while I was developing my tools, I constantly read and reread his articles on PyMC. His blog is still very active, and the research he produces on it (and yes, it is research) is terrific.
Selected articles from Healthy Algorithms
8. Yhat's blog
The team at Yhat have a really good blog, mostly of guest bloggers doing really cool things with data. Yhat is also the author of the python port of ggplot (which is pretty remarkable that it was done at all).
Selected articles from Yhat
Honourable Articles
- PIN Analysis by DataGenetics. This article is probably the most fun I've had reading a data article: great dataset, great insights, and great visuals. <3 those heatmaps! This analysis has inspired some of my own work into analysis passwords.
- High performance database joins with pandas DataFrame, more benchmarks by Wes McKinney. Basically, Pandas > SQLlite3 >> R.
- More Is Always Better: The Power Of Simple Ensembles is a great post on machine learning and simple averaging by a now dead blog.