Improving data quality in unstructured documents is a long-standing challenge. Unstructured data, especially in textual form, inherently lacks defined semantics, which poses significant challenges for effective processing and for ensuring data quality. We propose leveraging information extraction algorithms to design, apply, and explain data cleaning processes for documents. Specifically, for a simple document update model, we identify and verify a set of sufficient conditions for rule-based extraction programs to qualify for inclusion in our document cleaning framework. Through experiments conducted on medical records, we demonstrate that our approach provides an effective framework for identifying and correcting data quality problems, thereby highlighting its practical value in real-world applications.
翻译:暂无翻译