减轻数据集危害需要指导:从1 000份文件中吸取的经验教训 (Mitigating dataset harms requires stewardship: Lessons from 1000 papers)

Concerns about privacy, bias, and harmful applications have shone a light on the ethics of machine learning datasets, even leading to the retraction of prominent datasets including DukeMTMC, MS-Celeb-1M, TinyImages, and VGGFace2. In response, the machine learning community has called for higher ethical standards, transparency efforts, and technical fixes in the dataset creation process. The premise of our work is that these efforts can be more effective if informed by an understanding of how datasets are used in practice in the research community. We study three influential face and person recognition datasets - DukeMTMC, MS-Celeb-1M, and Labeled Faces in the Wild (LFW) - by analyzing nearly 1000 papers that cite them. We found that the creation of derivative datasets and models, broader technological and social change, the lack of clarity of licenses, and dataset management practices can introduce a wide range of ethical concerns. We conclude by suggesting a distributed approach that can mitigate these harms, making recommendations to dataset creators, conference program committees, dataset users, and the broader research community.

翻译：对隐私、偏见和有害应用的关切暴露了对机器学习数据集道德观的关注,甚至导致撤回包括DukeMTMC、MS-Celeb-1M、MMS-Celeb-1M、TiniyImages和VGGFace2在内的著名数据集。作为回应,机器学习界呼吁在数据集创建过程中提高道德标准、提高透明度和作出技术修正。我们工作的前提是,如果了解研究界在实践中如何使用数据集,就能使这些努力更加有效。我们研究了三个有影响力的面部和个人识别数据集,即DukMTMC、MS-Celeb-1M和Wild Ward(LFW)中的Labered Face。我们发现,创建衍生数据集和模型、更广泛的技术和社会变革、许可证缺乏清晰度以及数据集管理做法可以带来广泛的道德关切。我们最后提出一种可以减轻这些伤害的分布式方法,向数据集创建者、会议方案委员会、数据集用户以及更广泛的研究界提出建议。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【因果基础】Causality Basics，36页ppt

专知会员服务

52+阅读 · 2021年8月8日

最新《Transformers模型》教程，64页ppt

专知会员服务

320+阅读 · 2020年11月26日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日