任务数据质量管理统一框架 (A Unified Framework for Task-Driven Data Quality Management)

High-quality data is critical to train performant Machine Learning (ML) models, highlighting the importance of Data Quality Management (DQM). Existing DQM schemes often cannot satisfactorily improve ML performance because, by design, they are oblivious to downstream ML tasks. Besides, they cannot handle various data quality issues (especially those caused by adversarial attacks) and have limited applications to only certain types of ML models. Recently, data valuation approaches (e.g., based on the Shapley value) have been leveraged to perform DQM; yet, empirical studies have observed that their performance varies considerably based on the underlying data and training process. In this paper, we propose a task-driven, multi-purpose, model-agnostic DQM framework, DataSifter, which is optimized towards a given downstream ML task, capable of effectively removing data points with various defects, and applicable to diverse models. Specifically, we formulate DQM as an optimization problem and devise a scalable algorithm to solve it. Furthermore, we propose a theoretical framework for comparing the worst-case performance of different DQM strategies. Remarkably, our results show that the popular strategy based on the Shapley value may end up choosing the worst data subset in certain practical scenarios. Our evaluation shows that DataSifter achieves and most often significantly improves the state-of-the-art performance over a wide range of DQM tasks, including backdoor, poison, noisy/mislabel data detection, data summarization, and data debiasing.

翻译：高质量的数据对于培训实绩机器学习模型至关重要,这突出了数据质量管理的重要性。现有的数据质量管理(DQM)计划往往无法令人满意地改善 ML性能,因为设计上它们忽略了下游ML任务。此外,它们无法处理各种数据质量问题(特别是敌对性攻击造成的问题),并且只对某些类型的ML模型适用有限的应用。最近,数据评价方法(例如,基于沙普利值)被用来进行DQM;然而,实证研究发现,根据基本数据和培训过程,其性能差异很大。在本文件中,我们提出了一个任务驱动的、多用途的、模型性DQM框架、数据Sifter,它最优化于给定下游的ML任务,能够有效地消除各种缺陷的数据点,并适用于不同的模型。具体地说,我们将DQM作为一种优化问题,并设计一个可缩略算的算法来解决这个问题。此外,我们提出了一个理论框架,用来比较不同DQM战略的最坏的性性能性能。最明显地说,在最差的层次上,我们的数据显示我们的数据显示,我们的数据在最接近于最差的层次上,数据显示我们的数据显示我们的数据的精确的状态战略。