数据收集和深层学习的质量挑战:数据集中的AI视角 (Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective)

Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here software engineering needs to be re-thought where data becomes a first-class citizen on par with code. One striking observation is that a significant portion of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream. Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications. Data collection is important because there is lesser need for feature engineering for recent deep learning approaches, but instead more need for large amounts of data. For data quality, we study data validation, cleaning, and integration techniques. Even if the data cannot be fully cleaned, we can still cope with imperfect data during model training using robust model training techniques. In addition, while bias and fairness have been less studied in traditional data management research, these issues become essential topics in modern machine learning applications. We thus study fairness measures and unfairness mitigation techniques that can be applied before, during, or after model training. We believe that the data management community is well poised to solve these problems.

翻译：以数据为中心的AI是软件工程根本转变的核心。在软件工程中,机器学习成为新软件,由大数据和计算基础设施提供动力。在这里,软件工程需要重新思考数据成为一流公民的数据与代码相适应的情况。一个引人注目的观察是,机器学习过程的很大一部分花费在数据编制上。没有良好的数据,即使是最好的机器学习算法也不能很好地发挥作用。结果,数据中心AI的做法现在正在成为主流。不幸的是,现实世界中的许多数据集是小的、肮脏的、有偏差的、甚至有毒的。在本次调查中,我们研究数据收集和数据质量的研究环境主要是用于深层学习应用。数据收集是重要的,因为对于最近的深层学习方法对特征工程的需要较少,而对于大量的数据则需要更多。关于数据质量,我们研究数据验证、清理和整合技术不能很好。即使数据无法完全清理,我们也可以在模型培训中用强健的示范培训技术处理不完善的数据问题。此外,尽管传统数据管理研究中的偏差和公平性研究较少,但这些问题在现代机器学习应用应用中成为基本主题。因此,我们研究公平性的措施和不公证性管理方法可以被应用。