Large administrative claims databases are widely used to evaluate real-world effectiveness and safety of medications, as well as healthcare utilization and costs. Claims databases can make analyses easy because they contain previously collected health care data on large populations with diagnoses coded using standardized clinical classifications.
Administrative data were not designed to support research, however, and getting the right answers requires recognition of their limitations. When diagnostic codes are unavailable or inaccurate, claims analyses can be uninformative or misleading. Researchers address this problem either by simply accepting the inaccuracy or, more commonly, by using combinations of codes referred to as algorithms. In addition, validation studies are conducted in which the accuracy of these algorithms is compared with a more accurate external data source, such as a medical record. Because HealthCore is able to link our data asset to external sources, we conduct many validation studies to better understand potential limitations of claims data.
The results of validation studies show that coding algorithms are always inaccurate to some degree, depending on the code(s), the population, and other factors. Consequently, researchers are working to improve the accuracy with which we can identify cases in automated databases. One approach that is receiving increasing attention is the use of machine learning.
Machine learning combines advanced statistical methods with the results from validation studies (“supervised” learning) to construct new case-identifying algorithms. We have found these methods to offer marked improvement in the accuracy with which we can identify patients of interest. These methods not only reduce bias from misclassification, which translates into more valid research, but we have also been able to identify patients with conditions for which there are no diagnosis codes in claims, such as cancer stage or biomarker status. In this way, supervised machine learning coupled with validation can expand the utility of claims databases – allowing for the accurate characterization of difficult to measure populations or study endpoints.
The development of machine learning methods has been utilized in database studies conducted throughout the drug development process. Before a drug is marketed, information is needed about the target population, including its size, treatment patterns, comorbidities, health resource utilization and cost of care. By honing in on this population more accurately, our clients are better prepared to understand the unmet needs and deliver new therapies to patients who can benefit most. At approval, regulators seek additional information on rare safety outcomes that can be difficult to identify and where results depend on accurately identifying small numbers of cases. Identifying too many cases can overstate the risks, while failing to identify cases means that risks are underestimated. In safety studies, machine learning algorithms enable higher quality safety studies and better informed decisions about risks and benefits of treatment decisions.
Database studies have been serving the needs of the healthcare community for decades, and we are constantly striving to develop new methods that enable us to enhance the quality and broaden the application of database research. Machine learning is a method with many applications that are only beginning to be realized. In addition to identifying cases more accurately, these methods can also identify cases sooner—in some cases even before they are diagnosed. The potential to identify patients of interest in a more timely way can help speed the delivery of healthcare and improve patient outcomes. We are excited to participate in these developments and the promise they bring to role of real-world data in improving the quality of our healthcare.