Mandarin-English code-switching (CS) is frequently used among East and Southeast Asian people. However, the intra-sentence language switching of the two very different languages makes recognizing CS speech challenging. Meanwhile, the recent successful non-autoregressive (NAR) ASR models remove the need for left-to-right beam decoding in autoregressive (AR) models and achieved outstanding performance and fast inference speed. Therefore, in this paper, we took advantage of the Mask-CTC NAR ASR framework to tackle the CS speech recognition issue. We propose changing the Mandarin output target of the encoder to Pinyin for faster encoder training, and introduce Pinyin-to-Mandarin decoder to learn contextualized information. Moreover, we propose word embedding label smoothing to regularize the decoder with contextualized information and projection matrix regularization to bridge that gap between the encoder and decoder. We evaluate the proposed methods on the SEAME corpus and achieved exciting results.
翻译:东亚和东南亚人民经常使用普通话-英语代码转换(CS),但是,两种非常不同的语言在句内语言转换使得承认CS语言具有挑战性。与此同时,最近成功的非自动递增(NAR) ASR模型取消了在自动递增模式中左对右波波束解码的必要性,并取得了杰出的性能和快速引文速度。因此,在本文件中,我们利用Mask-CT NAR ASR框架解决CS语音识别问题。我们提议将编码器的普通话输出目标改为Pinyin,以进行更快的编码器培训,并引入Pininin-Mandarin解码器学习背景化信息。此外,我们提议用“内嵌”字来平稳地将解码器与背景化信息进行规范,并预测矩阵规范以弥合编码器与解码器之间的差距。我们评估了SEAMEprography的拟议方法并取得了令人振奋的成果。