Popular ASR benchmarks such as Librispeech and Switchboard are limited in the diversity of settings and speakers they represent. We introduce a set of benchmarks matching real-life conditions, aimed at spotting possible biases and weaknesses in models. We have found out that even though recent models do not seem to exhibit a gender bias, they usually show important performance discrepancies by accent, and even more important ones depending on the socio-economic status of the speakers. Finally, all tested models show a strong performance drop when tested on conversational speech, and in this precise context even a language model trained on a dataset as big as Common Crawl does not seem to have significant positive effect which reiterates the importance of developing conversational language models
翻译:Librispeech和总机等受欢迎的ASR基准(如Librispeech和总机)在不同的场合和发言者代表的演讲者方面受到限制。我们推出一套与实际生活条件相匹配的基准,目的是发现模型中可能存在的偏见和弱点。我们发现,尽管最近的模型似乎没有表现出性别偏见,但它们通常通过口音表现出重要的性能差异,甚至根据发言者的社会经济地位而显示出更重要的性能差异。最后,所有经过测试的模型都显示,在对谈话演讲进行测试时,其性能明显下降,在这种精确的背景下,即使是在像通用Crawl这样的大数据集上受过培训的语文模型也似乎没有显著的积极效果,这再次肯定了开发对话语言模式的重要性。