How we integrate multiple data sets in healthcare

In our last post, I discussed the importance of integrating multiple data sets in order to extract the best value from healthcare data. This week our Chief Data Scientist takes us through a closer look at just how we do that.

Since the 2009 passage of the Health Information Technology for Economic and Clinical Health Act (HITECH), the use of Electronic Health Record (EHR) systems has proliferated dramatically. As a result, the following has occurred:

  • Multiple EHR system vendors have deployed mature enterprise level systems (Cerner, Epic, etc.)
  • The amount of healthcare data collected has increased exponentially year-over-year
  • Disparate platforms have been transformed into healthcare data collection devices (smart phones, watches, etc.)
  • The data broker industry (buyers and sellers of healthcare data) has expanded

As terabytes upon petabytes of data are collected, we must ask: “how can we extract true value from these data?” 

Though there is much to be gleaned from one database, the analytics become truly meaningful – and more powerful – when multiple data sets are analyzed in aggregate. That said, transferring the knowledge acquired in one database to another one is an inherently difficult challenge. Typically, the data in different databases are collected under different circumstances. Even if certain standards are employed (such as uniformly accepted codes for demographic data, diseases and medical drugs), the individual data elements (also called variables or features) of databases frequently don’t match. One type of information (e.g. history of smoking) is present on one claim and is absent in another.

Traditional data analysis often discards reams of valuable information before the process of analytics can start. This is done to achieve the uniformity of features across the databases. More sophisticated analyses can capture the full breadth and depth of multiple data sets, despite differing features.

Consider two databases: one claims, and the other EHR. The claims database might contain 1000 features, and the EHR database might contain 70 features.  Obviously, the information in the claims data set is broader than the EHR database.

A traditional decision rule constructed for the claims data would be useless in the EHR data, because there are so many more features in the claims data. Therefore, the traditional approach would first remove all non-common features in the two data sets and restrict the rule to only common features. 

The new LUPI+ paradigm that we have been developing can lean on the full set of 1,000 claims features, and constructs decision rules which also incorporate the 70 EHR features.  Preliminary application of this technique has demonstrated an error rate improvement of 20%.

So how are we applying these techniques in healthcare analytics today?

We are exploiting the longitudinal patient histories contained in national claims databases, and combining them with EHR systems. We can combine the breadth of national claims with the depth of geographically localized hospital records to gain greater insight into the health characteristics of a patient population.

Using these techniques we are now able to not only extract value from a single database, but augment it, and fill in the gaps using other data sets. We can extract more value from a collection of databases than one, and as a result tackle far more complicated medical questions, all while maintaining patients’privacy.

Oodaye Shukla
Chief Data Scientist