The opacity of machine learning data is a significant threat to ethical data work and intelligible systems. Previous research has addressed this issue by proposing standardized checklists to document datasets. This paper expands that field of inquiry by proposing a shift of perspective: from documenting datasets toward documenting data production. We draw on participatory design and collaborate with data workers at two companies located in Bulgaria and Argentina, where the collection and annotation of data for machine learning are outsourced. Our investigation comprises 2.5 years of research, including 33 semi-structured interviews, five co-design workshops, the development of prototypes, and several feedback instances with participants. We identify key challenges and requirements related to the integration of documentation practices in real-world data production scenarios. Our findings comprise important design considerations and highlight the value of designing data documentation based on the needs of data workers. We argue that a view of documentation as a boundary object, i.e., an object that can be used differently across organizations and teams but holds enough immutable content to maintain integrity, can be useful when designing documentation to retrieve heterogeneous, often distributed, contexts of data production.
翻译:机器学习数据的不透明性是对伦理数据工作和容易理解的系统的重大威胁。以前的研究通过提出文件数据集的标准化核对清单来解决这一问题。本文扩大了调查领域,提出了从记录数据集到记录数据制作的转变观点。我们利用保加利亚和阿根廷两家公司的参与性设计和与数据工作者的合作,在这两家公司,为机器学习收集和注解数据的收集和注解工作外包。我们的调查包括2.5年的研究,包括33次半结构性访谈、5次共同设计讲习班、原型的开发以及参与者的数例反馈。我们确定了与将文件做法纳入现实世界数据制作情景有关的主要挑战和要求。我们的调查结果包括重要的设计考虑,并突出了根据数据工作者的需要设计数据文件的价值。我们说,将文件视为一个边界目标,即一个可在各组织和团队中不同使用、但具有足够不易变内容以保持完整性的物体,在设计文件以检索不同、通常分布的数据制作背景时,可能有用。