Continuous integrate-and-fire (CIF) based models, which use a soft and monotonic alignment mechanism, have been well applied in non-autoregressive (NAR) speech recognition with competitive performance compared with other NAR methods. However, such an alignment learning strategy may suffer from an erroneous acoustic boundary estimation, severely hindering the convergence speed as well as the system performance. In this paper, we propose a boundary and context aware training approach for CIF based NAR models. Firstly, the connectionist temporal classification (CTC) spike information is utilized to guide the learning of acoustic boundaries in the CIF. Besides, an additional contextual decoder is introduced behind the CIF decoder, aiming to capture the linguistic dependencies within a sentence. Finally, we adopt a recently proposed Conformer architecture to improve the capacity of acoustic modeling. Experiments on the open-source Mandarin AISHELL-1 corpus show that the proposed method achieves a comparable character error rates (CERs) of 4.9% with only 1/24 latency compared with a state-of-the-art autoregressive (AR) Conformer model. Futhermore, when evaluating on an internal 7500 hours Mandarin corpus, our model still outperforms other NAR methods and even reaches the AR Conformer model on a challenging real-world noisy test set.
翻译:持续整合和火灾模型(CIF)基于持续整合和火灾模型(CIF)使用软和单调校准机制,在非航空(NAR)语音识别中,与其他NAR方法相比,在竞争性性能上,在非航空(NAR)语音识别中很好地应用了竞争性表现,但是,这种校准学习战略可能因声音边界估计错误而受到影响,严重妨碍了趋同速度和系统性能。在本文件中,我们为基于CIF的NAR模型提出了一个边界和背景意识培训方法。首先,使用连接时间分类(CT)峰值信息来指导CIF的声波边界学习。此外,CIF解码后还引入了额外的背景解码器,目的是在句子内捕捉语言依赖性。最后,我们采用了最近提出的统一结构,以提高声学模型的能力。在开放源的Mandarin ASHELL-1系列实验中显示,拟议方法达到4.9%的可比性差率(CERs),而相对于状态自动反射模式(AR Constold)模型。Fermormor-formagistring a realstalstalstal stall agilling ontostation on romogy set roduction on 7 hard set roduction onstalstal set set rogymal setmal setmet se setmal setdal)。