Code-Switching (CS) remains a challenge for Automatic Speech Recognition (ASR), especially character-based models. With the combined choice of characters from multiple languages, the outcome from character-based models suffers from phoneme duplication, resulting in language-inconsistent spellings. We propose Contextualized Connectionist Temporal Classification (CCTC) loss to encourage spelling consistencies of a character-based non-autoregressive ASR which allows for faster inference. The CCTC loss conditions the main prediction on the predicted contexts to ensure language consistency in the spellings. In contrast to existing CTC-based approaches, CCTC loss does not require frame-level alignments, since the context ground truth is obtained from the model's estimated path. Compared to the same model trained with regular CTC loss, our method consistently improved the ASR performance on both CS and monolingual corpora.
翻译:代码转换(CS)仍然是自动语音识别(ASR)的一个挑战,特别是基于性格的模型。由于从多种语言中混合选择字符,基于性格模型的结果有电话重复,导致语言拼法不一致。我们提议了背景化连接时间分类(CCT)损失,以鼓励基于性格的非侵略性ASR的拼写组合,从而能够更快地推断。CCTC损失是预测背景的主要预测条件,以确保拼法的语言一致性。与现有的基于CTC的方法相比,CCT损失不需要框架水平的校准,因为背景地面真相来自模型的估计路径。与经过常规CTC损失培训的相同模式相比,我们的方法始终改进了ACS和单语体的ASR性能。