Continuous integrate-and-fire (CIF) based models, which use a soft and monotonic alignment mechanism, have been well applied in non-autoregressive (NAR) speech recognition and achieved competitive performance compared with other NAR methods. However, such an alignment learning strategy may also result in inaccurate acoustic boundary estimation and deceleration in convergence speed. To eliminate these drawbacks and improve performance further, we incorporate an additional connectionist temporal classification (CTC) based alignment loss and a contextual decoder into the CIF-based NAR model. Specifically, we use the CTC spike information to guide the leaning of acoustic boundary and adopt a new contextual decoder to capture the linguistic dependencies within a sentence in the conventional CIF model. Besides, a recently proposed Conformer architecture is also employed to model both local and global acoustic dependencies. Experiments on the open-source Mandarin corpora AISHELL-1 show that the proposed method achieves a comparable character error rate (CER) of 4.9% with only 1/24 latency compared with a state-of-the-art autoregressive (AR) Conformer model.
翻译:以连续整合和火灾为基础的模型使用软和单调匹配机制,这些模型在非航空语音识别中得到了很好的应用,并取得了与其他NAR方法相比的竞争性性能;然而,这种调整学习战略还可能导致声界估计不准确,趋同速度减速;为消除这些缺陷并进一步改进性能,我们将基于连接时间分类(CTC)的调整损失和背景解码器纳入基于CIF的NAR模型。具体地说,我们利用气候技术中心峰值信息来引导声界的倾斜,并采用新的背景解码来捕捉到常规CIF模型中某一句子内的语言依赖性。此外,最近提议的连接结构还被用于模拟当地和全球的声学依赖性。对开放源曼达林公司AISHELL-1的实验表明,拟议方法的特征误差率(CER)可比较为4.9%,与状态的自动倒退模式相比,只有1.44拉长。