无障碍数据集中的数据代表性:元分析 (Data Representativeness in Accessibility Datasets: A Meta-Analysis)

As data-driven systems are increasingly deployed at scale, ethical concerns have arisen around unfair and discriminatory outcomes for historically marginalized groups that are underrepresented in training data. In response, work around AI fairness and inclusion has called for datasets that are representative of various demographic groups.In this paper, we contribute an analysis of the representativeness of age, gender, and race & ethnicity in accessibility datasets - datasets sourced from people with disabilities and older adults - that can potentially play an important role in mitigating bias for inclusive AI-infused applications. We examine the current state of representation within datasets sourced by people with disabilities by reviewing publicly-available information of 190 datasets, we call these accessibility datasets. We find that accessibility datasets represent diverse ages, but have gender and race representation gaps. Additionally, we investigate how the sensitive and complex nature of demographic variables makes classification difficult and inconsistent (e.g., gender, race & ethnicity), with the source of labeling often unknown. By reflecting on the current challenges and opportunities for representation of disabled data contributors, we hope our effort expands the space of possibility for greater inclusion of marginalized communities in AI-infused systems.

翻译：由于数据驱动系统正在大规模部署,对培训数据中代表性不足的历史上边缘化群体产生了不公平和歧视性的结果,由此引发了道德问题;作为回应,关于AI公平和包容的工作要求建立代表不同人口群体的数据集。在本文件中,我们分析了在无障碍数据集中的年龄、性别、种族和族裔的代表性,这些数据来自残疾人和老年人,它们有可能在减少包容性的AI应用中的偏见方面发挥重要作用。我们通过审查公开可得的190个数据集的信息来审查残疾人在数据集中的代表性现状,我们称之为这些无障碍数据集。我们发现,无障碍数据集代表了不同年龄,但存在性别和种族代表性差距。此外,我们调查人口变量的敏感和复杂性质如何使得分类(例如性别、种族和族裔)变得困难和不一致,而标签往往不为人所知。通过反思残疾人数据贡献者目前的代表性的挑战和机遇,我们希望我们的努力能够扩大将边缘化社区更多地纳入AI使用系统中的可能性。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

Aspect-Oriented Syntax Network for Aspect-Based Sentiment Analysis，中山大学数据科学与计算机学院权小军教授，第八届全国社会媒体处理大会SMP2019

专知会员服务

19+阅读 · 2019年10月22日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日