Flexibility isn't just important for gymnasts.
In the past few weeks, we’ve discussed selecting and integrating the right data sets for a given question. Today, we’ll talk about how we use analytics to extract value from this data.
Starting with the right data is indispensable to our success – you can’t infer cost information from clinical notes, or information about pre-market drugs from insurance claims. Perhaps ironically, once we have this “big data” in hand, the first step is to perform what is known as data reduction. This doesn’t mean we arbitrarily drop swaths of data; rather, we selectively distill out the pieces that are most likely to be useful, and organize them into an efficiently accessible format. For example, we might prepare clinical notes using natural language processing (NLP), or MRI data using image recognition.
Even with well-structured data sets, such as some insurance claims data, we need to condense the information to the right level for the question at hand – do we care which medications the patient was taking, and if so, is it most appropriate to say they were on “antibiotics” or “penicillin” or “125mg V-Cillin K”? This area is known as dimensionality reduction, and it can either be done automatically within our predictive algorithms, or as a separate preliminary step. We lean towards performing feature extraction ahead of time so that clinical experts (our clients, our internal staff, or consultants) can review and validate the trends the computer finds against their real-world knowledge of the topic we are investigating.
There is a lot that we can learn from this streamlined, targeted data, such as which co-morbidities are associated with the target disease and how frequently they occur, what complications arise from a given drug, or which providers tend to administer a certain treatment. We can also apply predictive analytics to extract additional value, such as finding patients with undiagnosed rare diseases. There are many types of so-called “machine learning” algorithms we can use for these purposes: supervised and unsupervised, linear and nonlinear, etc. The choice is driven by the question and the data at hand.
The common theme throughout these phases, from data reduction to feature selection to predictive analytics, is to remain flexible so we can focus on delivering the right clinical and business value.
Senior Data Scientist