了解机器学习从业者的数据文件概念、需要、挑战和德塞拉塔 (Understanding Machine Learning Practitioners' Data Documentation Perceptions, Needs, Challenges, and Desiderata)

Data is central to the development and evaluation of machine learning (ML) models. However, the use of problematic or inappropriate datasets can result in harms when the resulting models are deployed. To encourage responsible AI practice through more deliberate reflection on datasets and transparency around the processes by which they are created, researchers and practitioners have begun to advocate for increased data documentation and have proposed several data documentation frameworks. However, there is little research on whether these data documentation frameworks meet the needs of ML practitioners, who both create and consume datasets. To address this gap, we set out to understand ML practitioners' data documentation perceptions, needs, challenges, and desiderata, with the goal of deriving design requirements that can inform future data documentation frameworks. We conducted a series of semi-structured interviews with 14 ML practitioners at a single large, international technology company. We had them answer a list of questions taken from datasheets for datasets (Gebru, 2021). Our findings show that current approaches to data documentation are largely ad hoc and myopic in nature. Participants expressed needs for data documentation frameworks to be adaptable to their contexts, integrated into their existing tools and workflows, and automated wherever possible. Despite the fact that data documentation frameworks are often motivated from the perspective of responsible AI, participants did not make the connection between the questions that they were asked to answer and their responsible AI implications. In addition, participants often had difficulties prioritizing the needs of dataset consumers and providing information that someone unfamiliar with their datasets might need to know. Based on these findings, we derive seven design requirements for future data documentation frameworks.

翻译：数据是开发和评价机器学习模型的核心。然而,使用有问题或不当的数据集可能会在采用由此产生的模型时造成伤害。为了鼓励负责任的大赦国际做法,通过更审慎地思考数据集,以及围绕其创建过程的透明度,研究人员和从业人员开始倡导增加数据文件,并提出了若干数据文件框架。然而,关于这些数据文件框架是否满足ML从业人员的需要,这些从业人员既创建又消耗数据集,几乎没有研究。然而,为了弥补这一差距,我们开始理解ML从业人员的数据文件概念、需要、挑战和偏差,目的是提出设计要求,为今后的数据文件框架提供参考。我们与14名ML从业人员在单一大型国际技术公司进行了一系列半结构性访谈,主张增加数据文件,并提议了一些数据文件框架(Gebru, 2021年)。我们的调查结果表明,目前的数据文件方法在很大程度上是临时性的,而且具有细微的特性。与会者表示,需要数据文件框架适应其背景,得出能够为未来数据文件框架提供参考信息的信息。我们经常将数据纳入有动力的版本框架,在可能情况下,数据被问的参与者需要从某个有数据格式和自动回答的问题。