减少、再利用和再循环:机器学习研究中数据集的寿命 (Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research)

Benchmark datasets play a central role in the organization of machine learning research. They coordinate researchers around shared research problems and serve as a measure of progress towards shared goals. Despite the foundational role of benchmarking practices in this field, relatively little attention has been paid to the dynamics of benchmark dataset use and reuse, within or across machine learning subcommunities. In this paper, we dig into these dynamics. We study how dataset usage patterns differ across machine learning subcommunities and across time from 2015-2020. We find increasing concentration on fewer and fewer datasets within task communities, significant adoption of datasets from other tasks, and concentration across the field on datasets that have been introduced by researchers situated within a small number of elite institutions. Our results have implications for scientific evaluation, AI ethics, and equity/access within the field.

翻译：基准数据集在机构学习研究的组织中发挥着核心作用,它们围绕共同研究问题协调研究人员,并充当实现共同目标的一个进展衡量尺度。尽管这一领域的基准做法具有基本作用,但相对较少注意在机器学习次社区内部或之间使用和再利用基准数据集的动态。在本文中,我们挖掘这些动态。我们研究了从2015-2020年到2015-2020年各机器学习次社区之间和不同时间,数据集使用模式如何不同。我们发现,任务社区内部日益集中于数量越来越少的数据集,大量采用其他任务中的数据集,并在外地集中关注少数精英机构内的研究人员引进的数据集。我们的结果对科学评估、AI道德以及实地的公平/存取产生影响。

相关内容

Machine Learning

关注 2242

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【哥伦比亚大学应用机器学习课程2020】《COMS W4995 Applied Machine Learning Spring 2020》

专知会员服务

26+阅读 · 2020年1月23日