We propose a CTC alignment-based single step non-autoregressive transformer (CASS-NAT) for speech recognition. Specifically, the CTC alignment contains the information of (a) the number of tokens for decoder input, and (b) the time span of acoustics for each token. The information are used to extract acoustic representation for each token in parallel, referred to as token-level acoustic embedding which substitutes the word embedding in autoregressive transformer (AT) to achieve parallel generation in decoder. During inference, an error-based alignment sampling method is proposed to be applied to the CTC output space, reducing the WER and retaining the parallelism as well. Experimental results show that the proposed method achieves WERs of 3.8%/9.1% on Librispeech test clean/other dataset without an external LM, and a CER of 5.8% on Aishell1 Mandarin corpus, respectively1. Compared to the AT baseline, the CASS-NAT has a performance reduction on WER, but is 51.2x faster in terms of RTF. When decoding with an oracle CTC alignment, the lower bound of WER without LM reaches 2.3% on the test-clean set, indicating the potential of the proposed method.
翻译:我们建议为语音识别提供一个基于CTC校准的单一步骤非自动递增变压器(CASS-NAT),具体地说,CTC校准包含以下信息:(a) 解码器输入的标记数量,以及(b) 每个代号的声学时间间隔。这些信息用于为每个代号同时提取声学表示,称为象征性声学嵌入器,替代自动递减变压器(AT)中嵌入的单词,以达到平行的解码生成。在推断期间,建议对CTC输出空间应用一个基于错误的校准抽样方法,减少WER,并保留平行论。实验结果显示,在Librispeech测试清洁/其他数据集中,在没有外部 LM 的情况下,建议的方法达到3.8%/9.1%的WER,Aishell1 Mandarin 堆中的CER为5.8%。与AT的基线相比,CASS-NAT对WER的性能下降,但对于RTF的性能速度为51.2x。当拟议的解析方法达到2.3%的WER调低调时,则表示CM的WER-rgleglegleglegs。