基于CTC对齐的非自回归Transformer实现端到端自动语音识别 (A CTC Alignment-based Non-autoregressive Transformer for End-to-end Automatic Speech Recognition)

Recently, end-to-end models have been widely used in automatic speech recognition (ASR) systems. Two of the most representative approaches are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. Autoregressive transformers, variants of AED, adopt an autoregressive mechanism for token generation and thus are relatively slow during inference. In this paper, we present a comprehensive study of a CTC Alignment-based Single-Step Non-Autoregressive Transformer (CASS-NAT) for end-to-end ASR. In CASS-NAT, word embeddings in the autoregressive transformer (AT) are substituted with token-level acoustic embeddings (TAE) that are extracted from encoder outputs with the acoustical boundary information offered by the CTC alignment. TAE can be obtained in parallel, resulting in a parallel generation of output tokens. During training, Viterbi-alignment is used for TAE generation, and multiple training strategies are further explored to improve the word error rate (WER) performance. During inference, an error-based alignment sampling method is investigated in depth to reduce the alignment mismatch in the training and testing processes. Experimental results show that the CASS-NAT has a WER that is close to AT on various ASR tasks, while providing a ~24x inference speedup. With and without self-supervised learning, we achieve new state-of-the-art results for non-autoregressive models on several datasets. We also analyze the behavior of the CASS-NAT decoder to explain why it can perform similarly to AT. We find that TAEs have similar functionality to word embeddings for grammatical structures, which might indicate the possibility of learning some semantic information from TAEs without a language model.

翻译：最近，端到端模型在自动语音识别（ASR）系统中被广泛使用。最具代表性的两种方法是连接主义时间分类（CTC）和基于注意力的编码器-解码器（AED）模型。自回归Transformer是AED的变体，采用自回归机制进行标记生成，因此在推理期间相对较慢。在本文中，我们提出了一种基于CTC对齐的单步非自回归Transformer（CASS-NAT）实现端到端ASR的全面研究。在CASS-NAT中，自回归Transformer中的单词嵌入被替换为从CTC对齐提取的标记级声学嵌入（TAE），这提供了声学边界信息。 TAE可以并行获取，从而生成输出标记的并行执行。在训练期间，使用维特比对齐生成TAE，并进一步探索多种训练策略以提高单词错误率（WER）性能。在推理期间，我们深入研究了基于误差的对齐抽样方法，以减少训练和测试过程中的对齐不匹配。实验结果表明，CASS-NAT在各种ASR任务上的WER接近于AT，同时提供了大约24倍的推理加速。无论是否进行自监督学习，我们都在几个数据集上实现了最新的非自回归模型研究成果。我们还分析了CASS-NAT解码器的行为，以解释它为什么能够执行类似于AT的功能。我们发现，TAE与语法结构的单词嵌入具有相似的功能，这可能表明在没有语言模型的情况下可以从TAE中学习一些语义信息。