Code-switching speech refers to a means of expression by mixing two or more languages within a single utterance. Automatic Speech Recognition (ASR) with End-to-End (E2E) modeling for such speech can be a challenging task due to the lack of data. In this study, we investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T), in Mandarin-English code-switching speech recognition. We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces. Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models, i.e., 16% relative Token-based Error Rate (TER) reduction averaged on three evaluation sets, and the approach of tying speech and text latent spaces is superior to that of TTS conversion on the evaluation set which contains more homogeneous data with the training set.
翻译:代码切换语音是指在单个话语中混合两种或更多的语言来表达。由于缺乏数据,对于这种语音的端到端(E2E)建模的自动语音识别(ASR)可能是一个具有挑战性的任务。在本研究中,我们探讨了文本生成和注入的方法,以提高在汉英代码交替语音识别中广泛使用的流模型——Transformer-Transducer(T-T)的性能。我们首先提出一种策略来生成代码切换文本数据,然后通过文本转语音(TTS)转换明确地将生成的文本注入T-T模型中,或者通过绑定语音和文本潜在空间隐含地注入。在包含1800小时真实汉英语言代码交替语音的数据集上,T-T模型的实验结果表明,我们的代码切换文本注入方法显著提高了T-T模型的性能,即在三个评估集上平均相对Token-based Error Rate(TER)降低16%,而将语音和文本潜在空间绑定的方法在包含更均质数据的评估集上表现优于TTS转换的方法。