Taking Data to the Cleaners
Research data can come in all degrees of “cleanliness.” Clean data are free of errors, out-of-range values, missing values, or misplaced values. It turns out that much can be done to improve the “cleanliness” of data — and our Resource Operations team is at the forefront of such efforts.
It begins with knowing, at the start of a research project, which statistical and software data-management tools will speed the process and increase the chances of producing a reliable, reproducible data set. From the outset, our data managers create a solid data management plan (DMP) outlining the steps they will take during each stage of the project. With a DMP in hand, all data management responsibilities and deliverables can be reviewed, revised if needed, and agreed upon by all stakeholders. Executing the plan consistently throughout the project lifespan promotes quality and timely data analysis.
During project start-up, the typical types of data entry errors such as misspellings, range errors and invalid data placement are dealt with by applying standardized coding schemes and validated range checks to each data item. Additional inter-and intra- form checks as well as careful directions and help text placed on data entry screens help create tight data entry routines.
During data collection, standardized reports are run on a defined basis, identifying missing data, data statistics by element, and inconsistencies not noticeable during first line edit checks. The information gathered from one report may not alert the data manager to any particular issue. Data managers, however, often become “data detectives” when they suspect potential site or system-wide underlying issues.
For example, in a complex randomized drug trial, the Data Manager (DM) noticed a high rate of missing values in the baseline data. Investigation showed that the results were being reported in a format other than what had been agreed upon in the DMP. Sites were setting the data to “missing” and reporting the results in a “comments” section rather than in the data field itself. Data in a comments section does not undergo the same scrutiny, i.e. validation and range checks as a data field. The error was caught early, the case report form was modified, and the reporting sites moved the actual data to the proper place. The result? Significantly cleaner data and valuable time savings.
A well-crafted DMP can also play a valuable role during project close–out. At this point, the DMP ties all data management activities together and provides historical documentation of what data management tasks occurred, what queries were issued, and what data cleaning requests were made to sites. Such information is not only invaluable as part of the overall validation of the project results, it can be used by others pursuing similar types of research. By learning from prior projects, researchers can create even better DMPs, which can initiate a positive feedback loop of improvement and progress.