We have talked a lot about our ability to find patients with rare (or hard-to-diagnose) diseases before they are diagnosed by their physicians. But have we mentioned that just one of the databases we use has 148 million unique lives? When you consider that each of those unique lives has thousands of columns of data, you can see that the data analysis gets complex fast.
Before I started working in healthcare analytics, I would not have realized just how complex that is. This week Sean O’Neil, one of our data scientists, takes us through a look at distributed processing, which enables us to scale our analytics to this level.
Tara Grabowsky, MD
Chief Medical Officer
So how can we manage so many data elements?
The answer is: Hadoop.
You’ve heard about how we find the right data sources (claims, EHR, etc.) to find these patients, and how we integrate those disparate sources to create a more holistic view of an anonymous patient’s history in the healthcare system. Hadoop is what makes it possible for us to run our analyses on millions and millions of data elements. Hadoop uses distributed computing.
Without distributed computing, Vencore Health wouldn’t be able to process all the data we’re constantly analyzing, let alone extract valuable insights.
Without distributed computing, one advanced query that we execute daily on these data sources would take weeks to months!
Without distributed computing, we would not be able to take our raw material and convert it into life-changing results.
In other words, Hadoop represents the “machines” that work tirelessly in our “factory” round the clock, twenty four hours a day, seven days a week.
For those not up-to-speed on this open-source software framework (you’re not the only one), let me tell you some of the basics of Hadoop. Hadoop has revolutionized how we store and process data. HDFS (Hadoop Distributed File System) has the ability to store your data on an individual machine or a cluster of machines connected together, hence “distributed” file system. This allows for more free space on your machine; not as much storage space is being utilized.
MapReduce enables us to slice and dice data into smaller chunks and then run analyses in parallel. What do I mean by parallel? MapReduce splits your data into separate nodes and instructs each computer in your cluster to carry out certain aspects of the task you instruct it to do (the map step), and then the results of those executed tasks are combined to give you an output (the reduce step). Essentially, this allows us to execute analyses on data without putting too much stress on our CPUs. Overall, Hadoop gets rid of the bottlenecks that would otherwise prevent big data storage and processing and enables us to answer very big questions with very big data sets.