The bi-encoder structure has been intensively investigated in code-switching (CS) automatic speech recognition (ASR). However, most existing methods require the structures of two monolingual ASR models (MAMs) should be the same and only use the encoder of MAMs. This leads to the problem that pre-trained MAMs cannot be timely and fully used for CS ASR. In this paper, we propose a monolingual recognizers fusion method for CS ASR. It has two stages: the speech awareness (SA) stage and the language fusion (LF) stage. In the SA stage, acoustic features are mapped to two language-specific predictions by two independent MAMs. To keep the MAMs focused on their own language, we further extend the language-aware training strategy for the MAMs. In the LF stage, the BELM fuses two language-specific predictions to get the final prediction. Moreover, we propose a text simulation strategy to simplify the training process of the BELM and reduce reliance on CS data. Experiments on a Mandarin-English corpus show the efficiency of the proposed method. The mix error rate is significantly reduced on the test set after using open-source pre-trained MAMs.
翻译:在密码转换自动语音识别(ASR)中,对双编码结构进行了深入调查,但是,大多数现有方法要求两种单一语言的ASR模型(MAMS)的结构应当相同,而且只使用MAMS的编码器。这导致一个问题,即经过事先培训的MAMS不能及时为CS ASR充分使用。在本文件中,我们建议CS ASR采用一种单一语言识别器聚合方法。它分为两个阶段:语音意识(SA)阶段和语言聚合(LF)阶段。在SA阶段,声学特征由两个独立的MAMS(MAMS)对两种语言特定的预测进行绘图。为了保持MASR对其本身语言的注意,我们进一步扩大MAMS语言培训战略。在LF阶段,BELM结合了两种语言特定的预测,以获得最后的预测。此外,我们提议了一个文本模拟战略,以简化BELM的培训过程和减少对CS数据的依赖。在Mandarin-Eng 实验中,两个独立的MAMS系统展示了所提议的方法的公开测试效率。