As data-driven systems are increasingly deployed at scale, ethical concerns have arisen around unfair and discriminatory outcomes for historically marginalized groups that are underrepresented in training data. In response, work around AI fairness and inclusion has called for datasets that are representative of various demographic groups.In this paper, we contribute an analysis of the representativeness of age, gender, and race & ethnicity in accessibility datasets - datasets sourced from people with disabilities and older adults - that can potentially play an important role in mitigating bias for inclusive AI-infused applications. We examine the current state of representation within datasets sourced by people with disabilities by reviewing publicly-available information of 190 datasets, we call these accessibility datasets. We find that accessibility datasets represent diverse ages, but have gender and race representation gaps. Additionally, we investigate how the sensitive and complex nature of demographic variables makes classification difficult and inconsistent (e.g., gender, race & ethnicity), with the source of labeling often unknown. By reflecting on the current challenges and opportunities for representation of disabled data contributors, we hope our effort expands the space of possibility for greater inclusion of marginalized communities in AI-infused systems.
翻译:由于数据驱动系统正在大规模部署,对培训数据中代表性不足的历史上边缘化群体产生了不公平和歧视性的结果,由此引发了道德问题;作为回应,关于AI公平和包容的工作要求建立代表不同人口群体的数据集。 在本文件中,我们分析了在无障碍数据集中的年龄、性别、种族和族裔的代表性,这些数据来自残疾人和老年人,它们有可能在减少包容性的AI应用中的偏见方面发挥重要作用。我们通过审查公开可得的190个数据集的信息来审查残疾人在数据集中的代表性现状,我们称之为这些无障碍数据集。我们发现,无障碍数据集代表了不同年龄,但存在性别和种族代表性差距。此外,我们调查人口变量的敏感和复杂性质如何使得分类(例如性别、种族和族裔)变得困难和不一致,而标签往往不为人所知。通过反思残疾人数据贡献者目前的代表性的挑战和机遇,我们希望我们的努力能够扩大将边缘化社区更多地纳入AI使用系统中的可能性。