Classifiers made simple

As the only physician in a company of 4,000+ employees, I have the fun of teaching my colleagues about medicine. I help decide which diseases we should target, I interface with clients (pharma, CROs etc.), and I make sure my teammates – who are mostly engineers – understand the fundamentals about the diseases we are discussing.

Conversely, I have no background in analytics. I can tell the team whether the results we obtain make sense from a medical perspective, but I certainly can’t write the scripts required to query the data directly. I must stop our staff meetings with a polite “what was that?” three or four times an hour. So they are teaching me too. (The best job in the world – I teach and learn every day!)

I have been thinking that there must be other people out there like me: people who work in/around/near analytics but don’t come from that background. So today’s post describes the way I have made sense of what we do. At Vencore, we are searching mountainous data bases for people with rare disease. It is the ultimate needle-in-a-haystack problem.  

When we start, we have over 110 million patients in just one of our databases. How can we determine who has a rare disease, especially when there is so much other health information that won’t be relevant? How do we filter out all of the hypertension, heart disease, and diabetes so that we can see the features of one particular rare disease?

Picture this: you are in a subway terminal and everyone in the crowd is wearing blue pants. How do you tell people apart? You realize that some are wearing black shirts, and others are in white. Now it is simple to divide what you see. The throngs waiting for their trains can be divided in to those in black, and those in white.

This is essentially what we do. We teach machines to ignore the blue pants (common health conditions) and look for the color of the shirts (the features that we want to find). The discriminating feature that we would select in this case is 'shirt color' and not 'pants color.’

Once we determine what the features are to discern between patient and not-patient, we build a classifier.   A classifier is a mathematical model that contains a description of what a patient with a disease looks like and a description of what a patient without a disease looks like.  The function of a classifier is to separate the population into those that have the disease but are not diagnosed and those that do not have the disease. Black shirts and white shirts.

There are many interim steps that make sure the classifier is not tainted or skewed.  It is an iterative process that ensures that the feature set is mathematically sound. Thankfully I don’t have to do that math. I am so grateful to work with brilliant people who don’t – for reasons I will never understand – want to hide under their desks when these calculations are required. We also make sure the feature set is clinically sound: Do our results make sense? Will they help identify patients with rare disease? By thinking mathematically and clinically in developing our analytics approach, we are pairing state-of-the-art analytics with a mission. But that is the topic for another post…

Tara Grabowsky, MD
Chief Medical Officer