Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages. While today's neural end-to-end (E2E) models deliver state-of-the-art performances on the task of automatic speech recognition (ASR) it is commonly known that these systems are very data-intensive. However, there is only a few transcribed and aligned CS speech available. To overcome this problem and train multilingual systems which can transcribe CS speech, we propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are concatenated. By using this training data, our E2E model improves on transcribing CS speech and improves performance over the multilingual model, as well. The results show that this augmentation technique can even improve the model's performance on inter-sentential language switches not seen during training by 5,03\% WER.
翻译:代码转换(CS)是指使用不同语言的词句和短语的交替使用现象,虽然今天的神经端到端模型(E2E)在自动语音识别任务上提供了最先进的表现,但众所周知,这些系统非常数据密集,然而,只有少数转录和对齐的 CS 语音可用。为了克服这一问题,并培训能够转录 CS 语言的多语言系统,我们提议一个简单而有效的数据增强系统,将不同源语言的音频和对应标签连接在一起。通过使用这一培训数据,我们的E2E模型改进了对 CS 语音签名的转录,并改进了多语种模式的性能。结果显示,这种增强技术甚至可以改进模型在5,03 ⁇ WER培训期间看不到的内流语言开关上的性能。