Speech to text models tend to be trained and evaluated against a single target accent. This is especially true for English for which native speakers from the United States became the main benchmark. In this work, we are going to show how two simple methods: pre-trained embeddings and auxiliary classification losses can improve the performance of ASR systems. We are looking for upgrades as universal as possible and therefore we will explore their impact on several models architectures and several languages.
翻译:对文本模型的演讲往往用单一目标口音进行培训和评价,对英语来说尤其如此,因为美国母语成为主要基准。在这项工作中,我们将展示两种简单的方法:预先培训的嵌入和辅助分类损失可以改善ASR系统的性能。我们正在寻找尽可能普及的升级,因此我们将探讨其对若干模型结构和几种语言的影响。