Working in progressive data is great. We have no shortage of meaningful questions to answer, organizers to empower, and fun experiments to run.
But before any of that can happen, we usually have to find and clean some sort of gnarly public data. And that's where this job can get frustrating.
Because there's no single, comprehensive clearinghouse for data on campaigns and elections, starting a new data project often means digging deep into a county elections website to find shapefiles or CSVs that will be structured in some unpredictable way. A surprising amount of the time, the data you need isn't even digitized.
At Deck, we've been there too. Over the last six years, we've written hundreds of scripts to scrape, process, and consolidate terabytes of elections data into one big relational database. Recently, we had a revelation: why aren't we sharing this with people??
So... now we are!
We're calling this new offering Hubble. It's a data subscription service that gives you access to everything we know about campaigns and elections from 2008 to today. This includes candidate filings, campaign finance reports, summaries of media coverage, district boundaries, election results, and lots more.
What we're sharing
The data we're sharing fall into the following groups:
The foundation of Hubble is a set of "entity" definitions. In Hubble, an entity is a person, place, event, organization, or thing that has a distinct role in our electoral system. These tables provide definitional information about campaign committees, political geographies, candidates, elections, and more.
Every day, our scrapers search dozens of campaign finance portals at the state, county, and municipal levels to gather the latest campaign finance reports and itemized contribution records. We then clean that data, match it to our campaign graph, and link contributions to the voterfile.
We work with several vendors — including Aylien and Critical Mention — to gather raw media content, from news articles to TV news transcripts. When new data comes in, our system identifies mentions of the candidates we're tracking. We then summarize what that coverage tells us about each campaign. That includes how much coverage the candidate is getting, the sentiment of the coverage, the types of media outlets providing coverage, the traits of those seeing the coverage, and more.
We've also worked to identify each candidate's political traits, demographic traits, and apparent ideology. One of the main sources we use for this work is VoteSmart, which catalogues endorsements and issue ratings made by thousands of PACs across the country.
Election results are the backbone of our candidate support predictions. We gather and process results at the district, precinct, and census block level. This data comes from Open Elections, the MIT Election Lab, Statewide Database, the Harvard Dataverse, and dozens of state and county election administrators.
Finally, we are also sharing the predictions we've generated using all of this data. That includes the district-level outcomes we're forecasting (updated every day) and the full suite of modeled scores we generate for each eligible voter.
How to get started
You can access Hubble as a shared BigQuery dataset or through a Google Cloud Storage bucket. All of the above is available for $5,000 per month with no long-term commitment.
If you want to get started now — or if you have any questions or feedback, please let us know! We're very eager to get this data in more hands and continue to expand it and improve its quality.