The autoregressive (AR) models, such as attention-based encoder-decoder models and RNN-Transducer, have achieved great success in speech recognition. They predict the output sequence conditioned on the previous tokens and acoustic encoded states, which is inefficient on GPUs. The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step. However, the NAR model still faces two major problems. On the one hand, there is still a great gap in performance between the NAR models and the advanced AR models. On the other hand, it's difficult for most of the NAR models to train and converge. To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT), which improves the performance and accelerating the convergence of the NAR model by learning prior knowledge from a parameters-sharing AR model. Furthermore, we introduce the two-stage method into the inference process, which improves the model performance greatly. All the experiments are conducted on a public Chinese mandarin dataset ASIEHLL-1. The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.
翻译:自动递减模式(AR)模式,如关注的编码器-代码模型和RNNN- Transporter模型,在语音识别方面取得了巨大成功。它们预测了以以前象征物和声调编码状态为条件的输出序列,这些状态在GPU上效率低下。非自动递减模式可以消除输出符号之间的时间依赖,至少一步地预测整个输出符号。然而,NAR模型仍面临两大问题。一方面,NAR模型和高级AR模型之间在性能方面仍然存在巨大差距。另一方面,大部分NAR模型难以培训和聚合。为了解决这两个问题,我们提出了一个名为两步的非显性变异(TSNAT)的新模式,通过从参数共享AR模型中学习先前的知识,提高NAR模型的性能并加速NAR模型的趋同速度。此外,我们将两阶段方法引入推论过程,大大改进了模型的性能。另一方面,大部分NAR模型很难被培训和聚合。为了解决这两个问题,我们提出了一个新的模式,我们提出了一个新的称为两步非显性变变变模型(TNAT)的新模式,通过从参数中取得许多中国SISAR-IANS-IADAR模型来展示。