Recently, there has been an increasing interest in unifying streaming and non-streaming speech recognition models to reduce development, training and deployment cost. The best-known approaches rely on either window-based or dynamic chunk-based attention strategy and causal convolutions to minimize the degradation due to streaming. However, the performance gap still remains relatively large between non-streaming and a full-contextual model trained independently. To address this, we propose a dynamic chunk-based convolution replacing the causal convolution in a hybrid Connectionist Temporal Classification (CTC)-Attention Conformer architecture. Additionally, we demonstrate further improvements through initialization of weights from a full-contextual model and parallelization of the convolution and self-attention modules. We evaluate our models on the open-source Voxpopuli, LibriSpeech and in-house conversational datasets. Overall, our proposed model reduces the degradation of the streaming mode over the non-streaming full-contextual model from 41.7% and 45.7% to 16.7% and 26.2% on the LibriSpeech test-clean and test-other datasets respectively, while improving by a relative 15.5% WER over the previous state-of-the-art unified model.
翻译:最近,统一流式和非流式语音识别模型以降低开发、训练和部署成本的兴趣日益增加。目前最好的方法依赖于基于窗口或动态块的注意力策略和因果卷积来最小化由于流式造成的错误率的下降。但是,非流式和独立训练的全语境模型之间的性能差距仍然相对较大。为了解决这个问题,我们提出了一种动态块卷积,用于取代混合的Connectionist Temporal Classification (CTC)-Attention Conformer体系结构中的因果卷积。此外,我们通过从全语境模型初始化权重和卷积和自注意力模块并行运行,展示了进一步的改进。我们在开源的Voxpopuli,LibriSpeech和内部会话数据集上评估了我们的模型。总体而言,我们的提议模型将LibriSpeech test-clean和test-other数据集上流式模式与非流式全语境模型之间的误差率从41.7%和45.7%降低到16.7%和26.2%,并且相对于之前的最佳统一模型有15.5%的错误率改进。