Predictive Modeling and Machine Learning: Sharpening the Focus of Real-World Evidence in the Age of Precision Medicine
Real-world evidence is being used increasingly to provide important information about natural history of a disease, drug utilization and treatment patterns, comorbidities, safety and effectiveness of medications, along with healthcare utilization, cost of care, and other economic analyses. Administrative databases are a cornerstone of real-world evidence, offering real-time data for large populations in which to study rare diseases and sensitive subgroups. While bringing unparalleled scope, large claims databases can lack the clinical depth needed to address certain challenges, while Electronic Medical Record databases can have significant missing data limitations.
Rapid advances in precision medicine are revealing many diseases to be heterogeneous entities with different etiologies and natural histories. This understanding is opening doors for the development of more effective targeted treatments. For example, over the past decade, breast cancer therapies have been approved for and successfully treated specific forms of the disease defined by molecular profiling. As targeted treatments proliferate, there is a growing need to better understand smaller subpopulations with more narrowly defined conditions of interest.
Administrative claims databases contain longitudinal healthcare information on all billed healthcare encounters on the scale needed to effectively study targeted indications. Rapidly evolving coding systems contain a plethora of diagnosis codes, yet lag the development of clinical information like staging, and molecular and genomic profiling. This situation presents a challenge to researchers trying to unlock the potential of automated databases to support the moving target of therapeutic advances.
Researchers have learned that patterns of care defined by combinations of diagnosis, procedure, and treatment codes in claims can identify people with health conditions of interest more accurately than is possible using diagnosis codes alone. These algorithms take the form of Boolean expressions with multiple codes included and a dichotomous outcome (if this AND not that, OR this, THEN case=yes). Usually codes are selected based on prior knowledge by clinicians with experience diagnosing the condition. Validation studies have shown that conventional algorithms often do not perform well, however, so there is a need for methodologies that can improve algorithm performance.
Formal statistical methods offer several strategies that can improve accuracy of case identification. Multivariate prediction models such as logistic regression models can estimate the probability of being a case as a continuous outcome instead of a binary outcome. In addition, predictors are not simply included in the algorithm as binary variables, rather they can have continuous coefficients depending on the strength of their relation to the outcome. Further, instead of relying solely on prior knowledge, models can use relationships in the validation data to select predictors and estimate the degree to which they are associated with the outcome. Using the validation study data to develop the model presents a risk of overfitting, in which the model is influenced by chance associations in the sample data that do not apply outside the sample. To avoid overfitting, validation samples can be split in two so that the model is developed using a training data set and model performance is assessed using a testing data set. To reduce the loss of efficiency introduced by splitting the validation data into two data sets, resampling methods such as cross-validation can be used. Finally, machine learning methods can be used to select the predictors and estimate their coefficients.
Breast cancer subtypes offer a case study in leveraging modern statistical methods to help overcome limitations in administrative data. Distinct epidemiologic and clinical features of different types of breast cancer cannot be identified accurately in claims data using diagnosis codes that do not identify these subtypes. Historically, the possibility of researching diseases for which there were unspecific or no diagnosis codes has often been seen as an important or even insurmountable limitation of administrative databases. However, the distinguishing characteristics and natural histories of patients with these diseases presents an opportunity to use available codes to uncover patterns that help identify patients of interest.
HealthCore has used machine learning to develop predictive model algorithms in claims, including algorithms to define early and advanced stage ER+/HER2- breast cancer. This algorithm illustrates the application of machine learning methods to identify cases with clinical characteristics—stage and biomarker status—that are not coded granularly in claims.
For the breast cancer study, we first developed a conventional algorithm by collaborating with subject matter experts with knowledge of clinical coding practice. We then used the validation study to try to improve on our conventional algorithm. We describe below a step-by-step approach:
Real-world evidence can usefully support precision medicine and faster regulatory approvals. The use of statistical methods to develop better case-identifying algorithms enables use of claims databases to study rare diseases and other targeted indications as well as more valid outcome ascertainment for safety and effectiveness research.
Further information on this work can be found in our publication of this study and in our recorded webinar with HealthEconomics.com: Identifying Medical Conditions in Administrative Claims Data: Validation and Machine Learning.
This blog was original published on HealthEconomics.com’s blog, tHEORetically speaking.