Datasets are central to training machine learning (ML) models. The ML community has recently made significant improvements to data stewardship and documentation practices across the model development life cycle. However, the act of deprecating, or deleting, datasets has been largely overlooked, and there are currently no standardized approaches for structuring this stage of the dataset life cycle. In this paper, we study the practice of dataset deprecation in ML, identify several cases of datasets that continued to circulate despite having been deprecated, and describe the different technical, legal, ethical, and organizational issues raised by such continuations. We then propose a Dataset Deprecation Framework that includes considerations of risk, mitigation of impact, appeal mechanisms, timeline, post-deprecation protocols, and publication checks that can be adapted and implemented by the ML community. Finally, we propose creating a centralized, sustainable repository system for archiving datasets, tracking dataset modifications or deprecations, and facilitating practices of care and stewardship that can be integrated into research and publication processes.
翻译:数据组是培训机器学习模式的核心。ML社区最近大大改进了整个模型开发生命周期的数据管理和文件做法。然而,对数据集的折旧或删除行为在很大程度上被忽略,目前没有标准的方法来构建数据集生命周期的这一阶段。在本文件中,我们研究数据集在ML的折旧做法,查明数据集尽管已经折旧但仍继续流通的若干案例,并描述此类延续引起的不同的技术、法律、伦理和组织问题。我们随后提议了一个数据组折旧框架,其中包括风险、减轻影响、上诉机制、时限、折旧后协议的考虑,以及可由ML社区调整和执行的出版检查。最后,我们提议建立一个集中、可持续的储存系统,用于将数据集归档、跟踪数据集的修改或折旧,并促进可纳入研究和出版过程的注意和管理做法。