We propose a streaming non-autoregressive (non-AR) decoding algorithm to deliberate the hypothesis alignment of a streaming RNN-T model. Our algorithm facilitates a simple greedy decoding procedure, and at the same time is capable of producing the decoding result at each frame with limited right context, thus enjoying both high efficiency and low latency. These advantages are achieved by converting the offline Align-Refine algorithm to be streaming-compatible, with a novel transformer decoder architecture that performs local self-attentions for both text and audio, and a time-aligned cross-attention at each layer. Furthermore, we perform discriminative training of our model with the minimum word error rate (MWER) criterion, which has not been done in the non-AR decoding literature. Experiments on voice search datasets and Librispeech show that with reasonable right context, our streaming model performs as well as the offline counterpart, and discriminative training leads to further WER gain when the first-pass model has small capacity.
翻译:我们建议使用一种流式非自动解码算法来考虑流式RNN-T模型的假设匹配。 我们的算法为简单的贪婪解码程序提供了简单的贪婪解码程序,同时能够在每个框架以有限的右背景下产生解码结果,从而享有高效和低延缓性。 这些优势是通过将离线 Align-Refine 算法转换为流式兼容而实现的,配有一个新的变压器解码结构,对文本和音频进行本地自我关注,以及每个层进行时间一致的交叉注意。 此外,我们还用最小字差率标准(MWER)对我们的模型进行歧视性培训,这种培训没有在非AR解码性文献中完成。 关于语音搜索数据集和Librispeech 的实验显示,在合理的右环境下,我们的流式模型运行以及离线对等功能,而歧视性培训导致在第一流式模型容量小时获得进一步的WER收益。