Software 2.0 is a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. As a result, software engineering needs to be re-thought where data becomes a first-class citizen on par with code. One striking observation is that 80-90% of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream. Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications. Data collection is important because there is lesser need for feature engineering for recent deep learning approaches, but instead more need for large amounts of data. For data quality, we study data validation and data cleaning techniques. Even if the data cannot be fully cleaned, we can still cope with imperfect data during model training where using robust model training techniques. In addition, while bias and fairness have been less studied in traditional data management research, these issues become essential topics in modern machine learning applications. We thus study fairness measures and unfairness mitigation techniques that can be applied before, during, or after model training. We believe that the data management community is well poised to solve problems in these directions.
翻译:软件2.0 是软件工程的根本性转变, 机器学习成为新软件, 由大数据和计算基础设施提供动力。 因此, 软件工程需要重新思考, 当数据成为一流公民时, 需要重新思考。 一个引人注目的观察是, 机器学习过程的80- 90%用于数据编制。 没有良好的数据, 甚至最好的机器学习算法也无法很好地运行。 结果, 以数据为中心的AI 做法现在正在成为主流。 不幸的是, 现实世界中的许多数据集都是小的、 肮脏的、 有偏向的甚至有毒的。 在这次调查中, 我们研究数据收集和数据质量的研究环境主要是用于深层次的学习应用。 数据收集很重要, 因为对于最近的深层次学习方法对特征工程的需求较少, 而对于大量数据的需求则更多。 对于数据质量, 我们研究数据验证和数据清理技术。 即使数据无法完全清理, 我们也可以在使用稳健的模型培训中处理不完善的数据。 此外, 虽然在传统数据管理研究中, 偏见和公平性的研究较少, 这些问题在现代机器学习应用应用中成为基本主题。 因此, 我们研究公平性和不公平性管理方法在学习后, 我们相信, 学习了公平性和公平性管理方法在学习中可以被应用。