Motivation

As part of a homework assignment for my data science class (JHU 140.712 Advanced Data Science), we predicted the Senate results in each state based on US opinion polling data (and any other sources of data). Three competitions for the class were held:

  1. Predict the number of Republican senators. You may provide an interval. Smallest interval that includes the election day result wins.
  2. Predict the republican-democrat (R-D) difference in each state. The predictions that minimize the residual sum of squares between predicted and observed differences wins.
  3. Report a confidence interval for the R-D difference in each state. If the election day result falls outside your confidence interval in more than two states you are eliminated. For those surviving this cutoff, we will add up the size of all confidence intervals and sum. The smallest total length of confidence interval wins.

In this document, I summarize the steps I took to build my model and my results.

Data

I downloaded or scraped data from:

  1. RealClear Politics as well as their historical polling data for 2012, 2014, and 2016
  2. Summary table from FiveThirtyEight used in their Senate Model (data available in a CSV format)
  3. Pollster ratings from FiveThirtyEight’s github

Analysis

I adjusted for three effects in a Bayesian model:

  1. Pollster effect. Take into account both the pollster’s reputation and their historical bias. The specific steps I took to adjust for the pollster effect are:
  • Eliminate polls banned by 538. I was unable to find any in the data, which is good.
  • Adjust numbers with mean-reverted bias, which represents a pollster’s bias (for Democrat or Republican candidates). For polls that I couldn’t match to the 538 pollster ratings, I did not adjust their reported differences. This adjustment created very minimal changes in the predictions, likely because most pollsters did not have significant biases.
  • Weight using 538 rating. I converted the letter grade 538 gave to each poll, representative of their reputability, to a numeric value. For pollsters that I couldn’t match to 538’s pollster database, I assigned them a letter grade of “C”. This is more or less an arbitrary assignment, but I took into consideration the assumption that reputable polls are likely to be found in the database and balanced that against the possibility that the poll was in fact in the database, but I failed to match it.
  1. Time effect/recency. Polls closer to the election date are considered to be more accurate and are thus weighted more heavily.
  2. Sample size. The larger the sample size of the poll, the more confident its estimate is. This was naturally implemented through the Bayesian model. The main idea is that I set a (weakly informative) prior on the proportion voting Democrat and the proportion voting Republican. Iteratively, with every poll, I then update the prior distributions to get the posterior distributions. Then I use the posterior mean as the adjusted estimator.

Of the three, it turns out that only the time effect had any significant impact on the predictions over the naive model (where I use a simple mean of the polls as the prediction).

Results

Below are plots of my predictions for the difference between Republican and Democrat votes for 2012, 2014, and 2016 along with the true election results. The blue circles are the adjusted estimators (with 95% credible intervals), the gray circles are the unadjusted naive estimators, and the colored diamonds are the true election results. Note that my intervals do a poor job of capturing the true result in many cases. I will account for this when submitting my result in Competition 3.