Human can recognize speech, as well as the peculiar accent of the speech simultaneously. However, present state-of-the-art ASR system can rarely do that. In this paper, we propose a multilingual approach to recognizing English speech, and related accent that speaker conveys using DNN-HMM framework. Specifically, we assume different accents of English as different languages. We then merge them together and train a multilingual ASR system. During decoding, we conduct two experiments. One is a monolingual ASR-based decoding, with the accent information embedded at phone level, realizing word-based accent recognition (AR), and the other is a multilingual ASR-based decoding, realizing an approximated utterance-based AR. Experimental results on an 8-accent English speech recognition show both methods can yield WERs close to the conventional ASR systems that completely ignore the accent, as well as desired AR accuracy. Besides, we conduct extensive analysis for the proposed method, such as transfer learning without-domain data exploitation, cross-accent recognition confusion, as well as characteristics of accented-word.
翻译:人类既可以同时识别语言,也可以同时识别语言的特殊口音。 但是,目前最先进的ASR系统很难做到这一点。 在本文中,我们提出一种多语种的方法来识别英语语言,以及使用 DNN-HMM 框架的演讲者传递的相关口音。 具体地说,我们将英语的不同口音作为不同的语言。 然后,我们把它们合并在一起,并训练一个多语言的ASR系统。 在解码过程中,我们进行两项实验。 一种是单语言的ASR解码,在电话上嵌入口音信息,实现基于字的口音识别(AR),另一种是以多种语言的ASR解码,实现一种近似基于全音的AR。 八进制英语语音识别实验结果显示这两种方法都能让WERs接近完全忽视口音的常规ASR系统,以及想要的AR精确度。 此外,我们对拟议方法进行了广泛的分析,例如不重复数据开发的学习、交叉识别混淆以及口音的特征。