In order to find undiagnosed patients with a rare disease you have to start with a large population. For a disease with a prevalence of 1 in 50,000 there would only be 750 patients in the entire state of California. That means the datasets we typically work with have to contain the healthcare records of many millions of unique patients. After taking into account all of the healthcare related data for each patient, these datasets can grow to terabytes of data. We have to comb through the claims records of hundreds of millions of people to find the relevant pattern of events that distinguishes a patient suffering from a rare disease from the broader population. To provide relevant and actionable insights into patient populations, this process must be fast enough to respond to the questions clients think of when they see the initial results.
The key to how we do this is data curation. We have analyzed a number of rare diseases and have developed an efficient processing workflow to answer our clients’ questions. Instead of starting from scratch with every question, we are able to utilize data structures that we have optimized for finding patients. Our workflow is implemented on the Hadoop platform to allow scalability in terms of the amount of data we process and the types of algorithms that we apply. As a simple example, we may create a table of how many times each patient had every possible diagnosis code. After that we can quickly extract the population of “gold standard” patients (known patients with a particular disease, such as multiple sclerosis). We typically work with the client to refine the ‘Gold Standard’ definition of their patient of interest, for example patients with multiple sclerosis but not optic neuritis.
We can also apply the same workflow to determine the top diagnosis codes that patients in a certain population receive. We look at which codes occur most frequently, and we can compare two populations –a target and a control– in order to filter out irrelevant codes that are common in both sets (such as hypertension). Compared to manually reading ICD-9 code descriptions and making assumptions about which ones doctors would use, this approach is faster and more accurate. We let the data tell us how the doctors actually code and what the patients actually experience. We let the data tell us which features are relevant. Once we know that, we are able to train models that can pull similar patients (who might not be diagnosed yet) out of the dataset. The data can talk to us because we take the time up front to process it into a form that we can easily query.