DataPerf: 数据中心AI开发基准 (DataPerf: Benchmarks for Data-Centric AI Development)

Mark Mazumder,Colby Banbury,Xiaozhe Yao,Bojan Karlaš,William Gaviria Rojas,Sudnya Diamos,Greg Diamos,Lynn He,Douwe Kiela,David Jurado,David Kanter,Rafael Mosquera,Juan Ciro,Lora Aroyo,Bilge Acun,Sabri Eyuboglu,Amirata Ghorbani,Emmett Goodman,Tariq Kane,Christine R. Kirkpatrick,Tzu-Sheng Kuo,Jonas Mueller,Tristan Thrush,Joaquin Vanschoren,Margaret Warren,Adina Williams,Serena Yeung,Newsha Ardalani,Praveen Paritosh,Ce Zhang,James Zou,Carole-Jean Wu,Cody Coleman,Andrew Ng,Peter Mattson,Vijay Janapa Reddi

Machine learning (ML) research has generally focused on models, while the most prominent datasets have been employed for everyday ML tasks without regard for the breadth, difficulty, and faithfulness of these datasets to the underlying problem. Neglecting the fundamental importance of datasets has caused major problems involving data cascades in real-world applications and saturation of dataset-driven criteria for model quality, hindering research growth. To solve this problem, we present DataPerf, a benchmark package for evaluating ML datasets and dataset-working algorithms. We intend it to enable the "data ratchet," in which training sets will aid in evaluating test sets on the same problems, and vice versa. Such a feedback-driven strategy will generate a virtuous loop that will accelerate development of data-centric AI. The MLCommons Association will maintain DataPerf.

翻译：机器学习(ML)研究一般侧重于模型,而最突出的数据集被用于日常 ML 任务,而没有考虑到这些数据集的广度、难度和对根本问题的忠实性。忽视数据集的根本重要性已经造成了一些重大问题,包括真实世界应用中的数据级联和以数据集驱动的标准对模型质量的饱和,从而阻碍了研究增长。为了解决这个问题,我们提出了DataPerf,这是用于评估 ML 数据集和数据集工作算法的基准套件。我们打算让培训组能够帮助评估相同问题的测试组,反之亦然。这种反馈驱动战略将产生良性循环,加速以数据为中心的AI的开发。刚德康门协会将维护DataPerf 。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

80+阅读 · 2020年7月26日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日