It is well known that many machine learning systems demonstrate bias towards specific groups of individuals. This problem has been studied extensively in the Facial Recognition area, but much less so in Automatic Speech Recognition (ASR). This paper presents initial Speech Recognition results on "Casual Conversations" -- a publicly released 846 hour corpus designed to help researchers evaluate their computer vision and audio models for accuracy across a diverse set of metadata, including age, gender, and skin tone. The entire corpus has been manually transcribed, allowing for detailed ASR evaluations across these metadata. Multiple ASR models are evaluated, including models trained on LibriSpeech, 14,000 hour transcribed, and over 2 million hour untranscribed social media videos. Significant differences in word error rate across gender and skin tone are observed at times for all models. We are releasing human transcripts from the Casual Conversations dataset to encourage the community to develop a variety of techniques to reduce these statistical biases.
翻译:众所周知,许多机器学习系统表现出对特定个人群体的偏见,这个问题已在法西承认领域进行了广泛研究,但在自动语音识别领域则没有进行广泛研究。本文介绍了关于“Casual Conversations”的最初语音识别结果。“Casual Conversations”是公开发布的846个小时程序,旨在帮助研究人员评估他们的计算机视觉和音频模型,以便在包括年龄、性别和皮肤音调在内的各种元数据中准确性。整个程序都是手工转录的,可以对这些元数据进行详细的ASR评价。对多种ASR模型进行了评价,包括就LibriSpeech、14 000小时转录和200多万小时未录制的社会媒体视频进行了培训的模型。所有模型在性别和皮肤调调频度上都观察到了语言错误率的重大差异。我们正在从Corual Conversation数据集中发布人类记录,以鼓励社区开发各种技术来减少这些统计偏见。