Title: 对语音数据集文档记录实践在机器学习中的表征 (Right the docs: Characterising voice dataset documentation practices used in machine learning)

Voice-enabled technology is quickly becoming ubiquitous, and is constituted from machine learning (ML)-enabled components such as speech recognition and voice activity detection. However, these systems don't yet work well for everyone. They exhibit bias - the systematic and unfair discrimination against individuals or cohorts of individuals in favour of others (Friedman & Nissembaum, 1996) - across axes such as age, gender and accent. ML is reliant on large datasets for training. Dataset documentation is designed to give ML Practitioners (MLPs) a better understanding of a dataset's characteristics. However, there is a lack of empirical research on voice dataset documentation specifically. Additionally, while MLPs are frequent participants in fairness research, little work focuses on those who work with voice data. Our work makes an empirical contribution to this gap. Here, we combine two methods to form an exploratory study. First, we undertake 13 semi-structured interviews, exploring multiple perspectives of voice dataset documentation practice. Using open and axial coding methods, we explore MLPs' practices through the lenses of roles and tradeoffs. Drawing from this work, we then purposively sample voice dataset documents (VDDs) for 9 voice datasets. Our findings then triangulate these two methods, using the lenses of MLP roles and trade-offs. We find that current VDD practices are inchoate, inadequate and incommensurate. The characteristics of voice datasets are codified in fragmented, disjoint ways that often do not meet the needs of MLPs. Moreover, they cannot be readily compared, presenting a barrier to practitioners' bias reduction efforts. We then discuss the implications of these findings for bias practices in voice data and speech technologies. We conclude by setting out a program of future work to address these findings -- that is, how we may "right the docs".

翻译：Abstract: 语音技术的应用越来越普遍，其中包括了语音识别和语音活动检测等机器学习（ML）技术组件。然而，这些系统尚未完全适用于所有人。它们存在偏见，即在特定的领域或集群中歧视或歧视少数族裔，这在年龄、性别和口音等方面表现出来。ML 靠大规模数据集进行训练。数据集文档编写旨在让 ML 从业者更好地了解数据集的特性。然而，有关语音数据集文档记录实践的实证研究十分缺乏。此外，虽然 ML 从业者经常参与公平性研究，但很少有研究聚焦于处理语音数据的人群。本研究填补了这个空白，将两种方法结合起来进行探索性研究。首先，我们开展了13个半结构化访谈，探讨语音数据集文档记录实践的多个视角。使用开放式及轴向编码方法，通过从 MLP's 角色和折衷角度来探究 MLP's 的实践。然后，我们有针对性地对9个语音数据集的语音数据集文档 (VDD) 进行采样。根据 MLP 角色和权衡的视角，比较了这两种方法的发现。我们发现当前的 VDD 实践是不完整、不足和不可比较的。语音数据集的特征以片段化、分离的方式被编码，通常不能满足 MLP's 的需求。此外，它们无法被简单地比较，为从业者的降低偏见提出了障碍。然后，我们讨论了这些发现对语音数据和语音技术中的偏见实践的影响。我们总结了未来的工作计划以解决这些发现，即如何在这个领域 "纠正文档记录实践"。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

67页PPT【ML+气象】使用机器学习技术对季节和次季节研究和预测，Use of Machine Learning Techniques for Seasonal and Subseasonal Studies and Predictions

专知会员服务

19+阅读 · 2022年3月4日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【SIGIR2020】一个统一的双视图模型，用于具有不一致性损失的评论总结和情绪分类，A Unified Dual-view Model for Review Summarization and Sentiment Classification with Inconsistency Loss

专知会员服务

22+阅读 · 2020年6月3日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日