While Transformers have achieved promising results in end-to-end (E2E) automatic speech recognition (ASR), their autoregressive (AR) structure becomes a bottleneck for speeding up the decoding process. For real-world deployment, ASR systems are desired to be highly accurate while achieving fast inference. Non-autoregressive (NAR) models have become a popular alternative due to their fast inference speed, but they still fall behind AR systems in recognition accuracy. To fulfill the two demands, in this paper, we propose a NAR CTC/attention model utilizing both pre-trained acoustic and language models: wav2vec2.0 and BERT. To bridge the modality gap between speech and text representations obtained from the pre-trained models, we design a novel modality conversion mechanism, which is more suitable for logographic languages. During inference, we employ a CTC branch to generate a target length, which enables the BERT to predict tokens in parallel. We also design a cache-based CTC/attention joint decoding method to improve the recognition accuracy while keeping the decoding speed fast. Experimental results show that the proposed NAR model greatly outperforms our strong wav2vec2.0 CTC baseline (15.1% relative CER reduction on AISHELL-1). The proposed NAR model significantly surpasses previous NAR systems on the AISHELL-1 benchmark and shows a potential for English tasks.
翻译:虽然转型者在端对端自动语音识别(E2E)自动语音识别(ASR)方面取得了可喜的成果,但其自动递减(AR)结构已成为加速解码进程的瓶颈。对于真实世界的部署,ASR系统期望在达到快速推断的同时高度准确。非自动递减(NAR)模式因其快速发回速度而成为一种受欢迎的替代方案,但它们仍然落后于AR系统的确认准确性。为了满足上述两项要求,我们在本文件中提出一个NAR CTC/ 注意模式,利用预先培训的声学和语言模型: wav2vec2.0 和BERT。为了缩小从预培训模式中获得的言语和文字表达方式之间的差距,我们设计了一个新颖的模式转换机制,更适合逻辑语言。在推断过程中,我们使用一个CTCS(NER)分支来生成一个目标长度,使BERT能够同时预测符号。我们还设计了一个基于缓存的CT/保持联合解码方法,以提高识别准确性,同时保持快速解码速度。ARCR1 实验结果显示我们提议的CRA1 大幅削减前一个基准值。