Data Quality Dashboard
A Tool for Error Detection and Root Cause Analysis in Production Data Capture Systems
When attempting to test a large and complex data capture (or data classification) system running in production mode, it is difficult, costly and time-consuming to locate pockets of errors occurring in the production data and even more difficult to determine the cause(s) of these errors.
In our work with the U.S. Census Bureau on the 2010 Census, we developed a powerful yet easy-to-use software tool to perform this type of “needle in the haystack” analysis. We refer to this tool as the Data Quality Dashboard (DQD).
The DQD compares the outputs of a production data capture system to a “truth” data set corresponding to the data source from which the production data is drawn. This truth data set can either be derived by means of an independent data capture system such as Production Data Quality (PDQ) or provided a priority as part of a Digital Test Deck® (DTD).
Like any typical “dashboard,” the DQD first compares the aggregate production data error rate to a specified target error rate to verify compliance with requirements. The tool then highlights areas within the production data where statistically significant errors are occurring, allowing one to drill down stepwise (even to the original data source) to determine the root cause of the errors. The tool can also display the processing history of errors to show how the production data capture system arrived at its answers. Once the root cause of an error is shown, it becomes an easy matter to resolve the problem, and make a system improvement. Continued testing in this manner will verify that the improvement has been made successfully, and enable continuous improvement (even for systems that are already meeting aggregate error rate requirements).
The DQD includes built-in filtering, sorting and ad hoc query features for advanced root cause analysis. Aggregated statistics from the tool can also be exported (without exposing the actual production data source) in a variety of file formats for more sophisticated analysis using popular off-the-shelf software packages. This makes the tool an ideal solution for analyzing data quality from sensitive data sources such as Census forms, electronic medical records, tax returns, etc.