数据收集和深层学习的质量挑战:数据集中的AI视角 (Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective)

Software 2.0 is a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. As a result, software engineering needs to be re-thought where data becomes a first-class citizen on par with code. One striking observation is that 80-90% of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream. Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications. Data collection is important because there is lesser need for feature engineering for recent deep learning approaches, but instead more need for large amounts of data. For data quality, we study data validation and data cleaning techniques. Even if the data cannot be fully cleaned, we can still cope with imperfect data during model training where using robust model training techniques. In addition, while bias and fairness have been less studied in traditional data management research, these issues become essential topics in modern machine learning applications. We thus study fairness measures and unfairness mitigation techniques that can be applied before, during, or after model training. We believe that the data management community is well poised to solve problems in these directions.

翻译：软件2.0 是软件工程的根本性转变, 机器学习成为新软件, 由大数据和计算基础设施提供动力。因此, 软件工程需要重新思考, 当数据成为一流公民时, 需要重新思考。一个引人注目的观察是, 机器学习过程的80- 90%用于数据编制。没有良好的数据, 甚至最好的机器学习算法也无法很好地运行。结果, 以数据为中心的AI 做法现在正在成为主流。不幸的是, 现实世界中的许多数据集都是小的、肮脏的、有偏向的甚至有毒的。在这次调查中, 我们研究数据收集和数据质量的研究环境主要是用于深层次的学习应用。数据收集很重要, 因为对于最近的深层次学习方法对特征工程的需求较少, 而对于大量数据的需求则更多。对于数据质量, 我们研究数据验证和数据清理技术。即使数据无法完全清理, 我们也可以在使用稳健的模型培训中处理不完善的数据。此外, 虽然在传统数据管理研究中, 偏见和公平性的研究较少, 这些问题在现代机器学习应用应用中成为基本主题。因此, 我们研究公平性和不公平性管理方法在学习后, 我们相信, 学习了公平性和公平性管理方法在学习中可以被应用。

相关内容

Machine Learning

关注 2240

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【ETH】最新《几何数据分析》2020课程，附PPT下载

专知会员服务

44+阅读 · 2020年12月18日

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

专知会员服务

67+阅读 · 2020年7月25日

【深度学习社区检测】Deep Learning for Community Detection: Progress, Challenges and Opportunities

专知会员服务

28+阅读 · 2020年6月13日