We live in a world where 60% of the population can speak two or more languages fluently. Members of these communities constantly switch between languages when having a conversation. As automatic speech recognition (ASR) systems are being deployed to the real-world, there is a need for practical systems that can handle multiple languages both within an utterance or across utterances. In this paper, we present an end-to-end ASR system using a transformer-transducer model architecture for code-switched speech recognition. We propose three modifications over the vanilla model in order to handle various aspects of code-switching. First, we introduce two auxiliary loss functions to handle the low-resource scenario of code-switching. Second, we propose a novel mask-based training strategy with language ID information to improve the label encoder training towards intra-sentential code-switching. Finally, we propose a multi-label/multi-audio encoder structure to leverage the vast monolingual speech corpora towards code-switching. We demonstrate the efficacy of our proposed approaches on the SEAME dataset, a public Mandarin-English code-switching corpus, achieving a mixed error rate of 18.5% and 26.3% on test_man and test_sge sets respectively.
翻译:我们生活在一个60%的人口能够流畅地讲两种或两种以上语言的世界中。 这些社区的成员在对话时经常在语言之间转换。 由于自动语音识别系统正在被部署到现实世界,因此需要一种实用系统,既可以在一个发音中,也可以跨发音中处理多种语言。 在本文中,我们提出了一个终端到终端的ASR系统,使用变压器/多音传输器模型结构来识别调译语音。我们建议对香草模型进行三项修改,以便处理调译代码的各个方面。首先,我们引入两个辅助损失功能,以处理代码转换的低资源情景。第二,我们提出一个新的基于面具的培训战略,提供语言识别信息,以改进标签编码编码培训,使之适应内部的代码转换。最后,我们建议建立一个多标签/多语种的调译器结构,以利用广博的单语调调调调调调调调音器。我们展示了我们提议的SEAME数据设置方法的功效,一个公共曼达林5 测试 和分别18 % 测试 测试 测试 和英语代码设置。