折旧数据集框架:标准化文件、识别和通信 (A Framework for Deprecating Datasets: Standardizing Documentation, Identification, and Communication)

Datasets are central to training machine learning (ML) models. The ML community has recently made significant improvements to data stewardship and documentation practices across the model development life cycle. However, the act of deprecating, or deleting, datasets has been largely overlooked, and there are currently no standardized approaches for structuring this stage of the dataset life cycle. In this paper, we study the practice of dataset deprecation in ML, identify several cases of datasets that continued to circulate despite having been deprecated, and describe the different technical, legal, ethical, and organizational issues raised by such continuations. We then propose a Dataset Deprecation Framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocols, and publication checks that can be adapted and implemented by the ML community. Finally, we propose creating a centralized, sustainable repository system for archiving datasets, tracking dataset modifications or deprecations, and facilitating practices of care and stewardship that can be integrated into research and publication processes.

翻译：数据组是培训机器学习模式的核心。ML社区最近大大改进了整个模型开发生命周期的数据管理和文件做法。然而,对数据集的折旧或删除行为在很大程度上被忽略,目前没有标准的方法来构建数据集生命周期的这一阶段。在本文件中,我们研究数据集在ML的折旧做法,查明数据集尽管已经折旧但仍继续流通的若干案例,并描述此类延续引起的不同的技术、法律、伦理和组织问题。我们随后提议了一个数据组折旧框架,其中包括风险、减轻影响、上诉机制、时限、折旧后协议的考虑,以及可由ML社区调整和执行的出版检查。最后,我们提议建立一个集中、可持续的储存系统,用于将数据集归档、跟踪数据集的修改或折旧,并促进可纳入研究和出版过程的注意和管理做法。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

“CVPR 2021 接受论文列表 1663篇论文都在这了

专知会员服务

32+阅读 · 2021年6月12日

2020数据工程师成长路线图

专知会员服务

19+阅读 · 2020年9月6日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日