While improvements have been made in automatic speech recognition performance over the last several years, machines continue to have significantly lower performance on accented speech than humans. In addition, the most significant improvements on accented speech primarily arise by overwhelming the problem with hundreds or even thousands of hours of data. Humans typically require much less data to adapt to a new accent. This paper explores methods that are inspired by human perception to evaluate possible performance improvements for recognition of accented speech, with a specific focus on recognizing speech with a novel accent relative to that of the training data. Our experiments are run on small, accessible datasets that are available to the research community. We explore four methodologies: pre-exposure to multiple accents, grapheme and phoneme-based pronunciations, dropout (to improve generalization to a novel accent), and the identification of the layers in the neural network that can specifically be associated with accent modeling. Our results indicate that methods based on human perception are promising in reducing WER and understanding how accented speech is modeled in neural networks for novel accents.
翻译:在过去几年里,自动语音识别表现有所改进,但机器在口音方面的性能比人类要低得多。此外,口音方面最显著的改进主要产生于数百小时甚至数千小时数据的巨大问题。人类通常需要的数据要少得多,以适应新的口音。本文探讨了人类感知所激发的用来评价可能的业绩改进以确认口音的方法,具体重点是承认与培训数据相对应的新口音的语音。我们的实验是在研究界可获得的小型、无障碍的数据集上进行的。我们探索了四种方法:预先接触多种口音、石墨和基于电话的预发音、辍学(改进对新口音的概括化)以及确定神经网络中与口音模型具体相关的层。我们的研究结果表明,基于人类感知的方法在减少WER和理解新口音的神经网络中如何建模方面很有希望。