(originally posted on camdp.com)
Preface: This example is a (greatly modified) excerpt from the open-source book Bayesian Methods for Hackers, currently being developed on Github ;)
Why is sorting from "best" to "worst" so difficult?
Consider ratings on online products: how often do you trust an average 5-star rating if there is only 1 reviewer? 2 reviewers? 3 reviewers? We implicitly understand that with such few reviewers that the average rating is not a good reflection of the true value of the product.
This has created flaws in how we sort items. Many people have realized that sorting online search results by their rating, whether the objects be books, videos, or online comments, return poor results. Often the seemingly top videos or comments have perfect ratings only from a few enthusiastic fans, and truly more quality videos or comments are hidden in later pages with falsely-substandard ratings of around 4.8. How can we correct this?
Consider the popular site Reddit (purposefully did not link to the website as you would never come back). The site hosts links to stories or images, and a very popular part of the site are the comments associated with each link. Redditors can vote up or down on each comment (called upvotes and downvotes). Reddit, by default, will sort comments by Top, that is, the best comments. How would you determine which comments are the best? There are a number of ways to achieve this:
- Popularity: A comment is considered good if it has many upvotes. A problem with this model is to consider a comment with hundreds of upvotes, but thousands of downvotes. While being very popular, the comment is likely more controversial than best.
- Difference: Using the difference of upvotes and downvotes. This solves the above problem, but fails when we consider the temporal nature of comments. Comments can be posted many hours after the original link submission. The difference method will bias the Top comments to be the oldest comments, which have accumulated more upvotes than newer comments, but are not necessarily the best.
- Time adjusted: Consider using Difference divided by the age of the comment. This creates a rate, something like difference per second, or per minute. An immediate counter example is, if we use per second, a 1 second old comment with 1 upvote would be better than a 100 second old comment with 99 upvotes. One can avoid this by only considering at least \(t\) second old comments. But what is a good \(t\) value? Does this mean no comment younger than \(t\) is good? We end up comparing unstable quantities with stable quantities (young vs. old comments).
- Ratio: Rank comments by the ratio of upvotes to total number of votes (upvotes plus downvotes). This solves the temporal issue, such that new comments who score well can be considered Top just as likely as older comments, provided they have many upvotes to total votes. The problem here is that a comment with a single upvote (ratio = 1.0) will beat a comment with 999 upvotes and 1 downvote (ratio = 0.999), but clearly the latter comment is more likely to be better.
I used the phrase more likely for good reason. It is possible that the former comment, with a single upvote, is in fact a better comment than the later with 999 upvotes. The hesitation to agree with this is because we have not seen the other 999 potential votes the former comment might get. Perhaps it will achieve an additional 999 upvotes and 0 downvotes and be considered better than the latter, though not likely.
What we really want is an estimate of the true upvote ratio. Note that the true upvote ratio is not the same as the observed upvote ratio: the true upvote ratio is hidden, and we only observe upvotes vs. downvotes (one can think of the true upvote ratio as "what is the underlying probability someone gives this comment a upvote, versus a downvote"). So the 999 upvote/1 downvote comment probably has a true upvote ratio close to 1, which we can assert with confidence thanks to the Law of Large Numbers, but on the other hand we are much less certain about the true upvote ratio of the comment with only a single upvote. Sounds like a Bayesian problem to me.
What is our prior distribution though? (Prior? If you are unfamiliar with Bayesian methods, you should totally read the book, or check out this quick intro). One way to determine a prior on the upvote ratio is that look at the historical distribution of upvote ratios. This can be accomplished by scrapping Reddit's comments and determining a distribution. There are a few problems with this idea though:
- Skewed data: The vast majority of comments have very few votes, hence there will be many comments with ratios near the extremes of 1.0 and 0, effectively skewing our distribution to the extremes.
- Biased data: Reddit is composed of different subpages, called subreddits. Two examples are r/aww, which posts pics of cute animals, and r/politics. It is very likely that the user behaviour towards comments of these two subreddits are very different: visitors are likely friendly and affectionate in the former, and would therefore upvote comments more, compared to the latter, where comments are likely to be controversial and disagreed upon. Therefore not all comments are the same.
In light of these, I think it is better to use a
Uniform prior for the true upvote ratio.
Below is an image that was trending Reddit last weekend. We have scrapped all the comments, plus their upvotes and downvotes. Editor's Note: The book, Bayesian Methods for Hackers, scrapes Reddit in real-time, and returns the currently most popular image + comments.
Title of submission: The Golden Tortoise Beatle. http://i.imgur.com/6F7JytV.jpg
Some Comments (out of 69 total) ----------- "That's badass as fuck" upvotes/downvotes: [34 4] "What kind of dog is this? " upvotes/downvotes: [407 39] "http://i.imgur.com/8dMVsoQ.jpg" upvotes/downvotes: [2 0] "I'm ready for the perpetual downvotes into extreme negative karma, but it's "beetle"" upvotes/downvotes: [13 2]
For a given true upvote ratio \(p\) and \(N\) total votes, the number of upvotes will look like a Binomial random variable with parameters \(p\) and \(N\). (This is because of the equiviliance between upvote ratio and probability of upvoting versus downvoting, out of \(N\) possible votes/trials). We create a function that performs Bayesian inference on \(p\), for a particular comment's upvote/downvote pair. We are going to use PyMC to determine the posterior distribution of the true upvote ratio for each comment:
import pymc as mc def posterior_upvote_ratio( upvotes, downvotes, samples = 20000): N = upvotes + downvotes upvote_ratio = mc.Uniform( "upvote_ratio", 0, 1 ) observations = mc.Binomial( "obs", N, upvote_ratio, value = upvotes, observed = True) model = mc.Model( [upvote_ratio, observations ]) map_ = mc.MAP(model).fit() mcmc = mc.MCMC(model) mcmc.sample(samples, samples/4) return mcmc.trace("upvote_ratio")[:]
Below are the resulting posterior distributions.