The awareness for biased ASR datasets or models has increased notably in recent years. Even for English, despite a vast amount of available training data, systems perform worse for non-native speakers. In this work, we improve an accent-conversion model (ACM) which transforms native US-English speech into accented pronunciation. We include phonetic knowledge in the ACM training to provide accurate feedback about how well certain pronunciation patterns were recovered in the synthesized waveform. Furthermore, we investigate the feasibility of learned accent representations instead of static embeddings. Generated data was then used to train two state-of-the-art ASR systems. We evaluated our approach on native and non-native English datasets and found that synthetically accented data helped the ASR to better understand speech from seen accents. This observation did not translate to unseen accents, and it was not observed for a model that had been pre-trained exclusively with native speech.
翻译:近年来,对有偏向的ASR数据集或模型的认识有了显著提高。即使是英语,尽管现有大量培训数据,但非母语发言人的系统也更差。在这项工作中,我们改进了口音转换模型(ACM),将原美英语言转换为口音发音。我们在ACM培训中包括了语音知识,以提供准确的反馈,说明合成波形中某些发音模式的恢复情况。此外,我们调查了学习口音表达而不是静态嵌入的可行性。然后,将生成的数据用于培训两个最先进的ASR系统。我们评估了我们在本地和非本地英语数据集方面的做法,发现合成口音数据有助于ASR更好地了解从所看到的口音中听到的言语。这一观察没有被翻译成看不见的口音,而且没有被观察到是专门用本地语进行预先训练的模式。</s>