Concerns about privacy, bias, and harmful applications have shone a light on the ethics of machine learning datasets, even leading to the retraction of prominent datasets including DukeMTMC, MS-Celeb-1M, TinyImages, and VGGFace2. In response, the machine learning community has called for higher ethical standards, transparency efforts, and technical fixes in the dataset creation process. The premise of our work is that these efforts can be more effective if informed by an understanding of how datasets are used in practice in the research community. We study three influential face and person recognition datasets - DukeMTMC, MS-Celeb-1M, and Labeled Faces in the Wild (LFW) - by analyzing nearly 1000 papers that cite them. We found that the creation of derivative datasets and models, broader technological and social change, the lack of clarity of licenses, and dataset management practices can introduce a wide range of ethical concerns. We conclude by suggesting a distributed approach that can mitigate these harms, making recommendations to dataset creators, conference program committees, dataset users, and the broader research community.
翻译:对隐私、偏见和有害应用的关切暴露了对机器学习数据集道德观的关注,甚至导致撤回包括DukeMTMC、MS-Celeb-1M、MMS-Celeb-1M、TiniyImages和VGGFace2在内的著名数据集。作为回应,机器学习界呼吁在数据集创建过程中提高道德标准、提高透明度和作出技术修正。我们工作的前提是,如果了解研究界在实践中如何使用数据集,就能使这些努力更加有效。我们研究了三个有影响力的面部和个人识别数据集,即DukMTMC、MS-Celeb-1M和Wild Ward(LFW)中的Labered Face。我们发现,创建衍生数据集和模型、更广泛的技术和社会变革、许可证缺乏清晰度以及数据集管理做法可以带来广泛的道德关切。我们最后提出一种可以减轻这些伤害的分布式方法,向数据集创建者、会议方案委员会、数据集用户以及更广泛的研究界提出建议。