Finding the "right" big data - the first challenge in meaningful analytics

We are inundated with talk of “Big Data” in our blogs, tweets and newspapers. Is that only because it is a buzz word? Are people actually finding meaningful ways to (1) collect, (2) integrate, and (3) analyze the exabytes of data out there? 

We will spend our next few blog posts addressing this question, and looking at each of those three components.

In the defense industry, we have been using big data for decades – far longer than the term “big data” has been in use. The healthcare sector is still early in the process of finding ways to extract real value from the mountains of data out there. This is becoming more possible every day as the quality of available information is improving, and the quantity is increasing.

We are successfully developing the tools to ingest, integrate and analyze this data.  But as anyone who works in this field knows, data acquisition – the first step in valuable data analysis -- is not all that simple.  Finding the right data sets for a specific challenge can be a daunting task.  So where do we start?


We are finding great value in free data!  Getting our hands on public data and understanding it informs us how to aggregate disparate data sets the right way and which private datasets to target. Here is a list of public datasets we’ve reviewed:

  • CMS Open Payments Database
  • Dartmouth Atlas of Healthcare
  • openFDA Adverse Events Reports
  • County Health Rankings
  • and
  • and AACT Database
  • Medical Imagery Data from the Osteoarthritis Initiative

Business Relationships:

We are dedicated to finding patients with rare disease who are undiagnosed. This is the proverbial needle-in-a-haystack problem. To do this, we need vast quantities of high quality data.  Toward that end, we are working with the following acquired datasets:

  • Claims
  • Electronic Health Records (including notes, lab results, radiology results, etc.)
  • Consumer Demographics
  • Data bases of just clinical notes

It is crucial to make sure the available dataset doesn’t drive the question. Rather, the question should drive the choice of data.

If we don’t start with the right raw material, we will never build a good product.

Sean O’Neil
Data Scientist