Nowadays, research in speech technologies has gotten a lot out thanks to recently created public domain corpora that contain thousands of recording hours. These large amounts of data are very helpful for training the new complex models based on deep learning technologies. However, the lack of dialectal diversity in a corpus is known to cause performance biases in speech systems, mainly for underrepresented dialects. In this work, we propose to evaluate a state-of-the-art automatic speech recognition (ASR) deep learning-based model, using unseen data from a corpus with a wide variety of labeled English accents from different countries around the world. The model has been trained with 44.5K hours of English speech from an open access corpus called Multilingual LibriSpeech, showing remarkable results in popular benchmarks. We test the accuracy of such ASR against samples extracted from another public corpus that is continuously growing, the Common Voice dataset. Then, we present graphically the accuracy in terms of Word Error Rate of each of the different English included accents, showing that there is indeed an accuracy bias in terms of accentual variety, favoring the accents most prevalent in the training corpus.
翻译:目前,由于最近创建的含有数千个记录小时的公共域域公司,对语音技术的研究已大获成功。这些大量数据对培训基于深层学习技术的新的复杂模型非常有益。然而,据知,一个材料中缺乏方言多样性导致语音系统中的性能偏差,主要针对代表人数不足的方言。在这项工作中,我们提议评估一个最先进的自动语音识别(ASR)深层学习模型,使用来自一个有世界各地不同国家各种有标签的英语口音的文体的无形数据。该模型已经接受了44.5K小时英语语言演讲的培训,在公开存取材料中显示流行基准的显著结果。我们用从另一个正在不断增长的公共文体中提取的样本,即普通语音数据集,测试了这种方言体的准确性。然后,我们以图表的方式展示了每种不同英文的言词错误率的准确性,表明在口音种类上确实存在偏差,有利于培训材料中最流行的口音。