Human annotated data plays a crucial role in machine learning (ML) research and development. However, the ethical considerations around the processes and decisions that go into dataset annotation have not received nearly enough attention. In this paper, we survey an array of literature that provides insights into ethical considerations around crowdsourced dataset annotation. We synthesize these insights, and lay out the challenges in this space along two layers: (1) who the annotator is, and how the annotators' lived experiences can impact their annotations, and (2) the relationship between the annotators and the crowdsourcing platforms, and what that relationship affords them. Finally, we introduce a novel framework, CrowdWorkSheets, for dataset developers to facilitate transparent documentation of key decisions points at various stages of the data annotation pipeline: task formulation, selection of annotators, platform and infrastructure choices, dataset analysis and evaluation, and dataset release and maintenance.
翻译:人类附加说明的数据在机器学习(ML)研究与开发中发挥着关键作用。然而,关于进入数据集注释的过程和决定的伦理考虑并未得到足够重视。在本文中,我们调查了各种文献,这些文献为众源数据集注释的伦理考虑提供了深刻的见解。我们综合了这些洞察力,并分两层列出了这一空间的挑战:(1) 说明人是谁,说明人的生活经历如何影响其说明,(2) 说明人与众包平台之间的关系,以及这种关系给它们带来什么好处。 最后,我们引入了一个全新的框架,即CrowdWorkSheets,让数据集开发者为数据注释管道各阶段关键决定点的透明文件提供便利:任务拟定、选择说明人、平台和基础设施选择、数据集分析和评估以及数据集发布和维护。