Despite the rapid progress in automatic speech recognition (ASR) research, recognizing multilingual speech using a unified ASR system remains highly challenging. Previous works on multilingual speech recognition mainly focus on two directions: recognizing multiple monolingual speech or recognizing code-switched speech that uses different languages interchangeably within a single utterance. However, a pragmatic multilingual recognizer is expected to be compatible with both directions. In this work, a novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information and generating frame-level language-aware representations during encoding. In the LAE, the primary encoding is implemented by the shared block while the language-specific blocks are used to extract specific representations for each language. To learn language-specific information discriminatively, a language-aware training method is proposed to optimize the language-specific blocks in LAE. Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level and shows superior performance on both monolingual and multilingual ASR tasks. With either a real-recorded or simulated code-switched dataset, the proposed LAE achieves statistically significant improvements on both CTC and neural transducer systems. Code is released
翻译:尽管在自动语音识别(ASR)研究方面进展迅速,但使用统一的ASR系统承认多语种语言的研究仍然极具挑战性。以前关于多语种语音识别的工作主要侧重于两个方向:承认多种单一语言的语音,或承认在单一语句中可互换使用不同语言的密码转换语言的语音。然而,一个实用的多语种识别器预计将与两个方向相容。在这项工作中,提议建立一个新颖的有语言意识的编码器(LAE)结构,通过拆分特定语言的信息和在编码过程中生成框架一级的语言觉悟表征,来处理两种情况。在LAE中,主要编码由共享块执行,而语言特定块则用于为每种语言提取具体的表达方式。为了有区别地学习特定语言的信息,建议一种有语言意识的培训方法来优化LAEE的特定语言块。对曼达林英语编码转换码的实验表明,拟议的LAEE能够在框架一级歧视不同语言,并显示单一语言和多语种语言任务上的优异性表现。在实际记录或模拟的代码转换系统上都取得了显著的改进。