In the present paper, an attempt is made to combine Mask-CTC and the triggered attention mechanism to construct a streaming end-to-end automatic speech recognition (ASR) system that provides high performance with low latency. The triggered attention mechanism, which performs autoregressive decoding triggered by the CTC spike, has shown to be effective in streaming ASR. However, in order to maintain high accuracy of alignment estimation based on CTC outputs, which is the key to its performance, it is inevitable that decoding should be performed with some future information input (i.e., with higher latency). It should be noted that in streaming ASR, it is desirable to be able to achieve high recognition accuracy while keeping the latency low. Therefore, the present study aims to achieve highly accurate streaming ASR with low latency by introducing Mask-CTC, which is capable of learning feature representations that anticipate future information (i.e., that can consider long-term contexts), to the encoder pre-training. Experimental comparisons conducted using WSJ data demonstrate that the proposed method achieves higher accuracy with lower latency than the conventional triggered attention-based streaming ASR system.
翻译:在本文件中,试图将Mask-CT和触发注意机制结合起来,以构建一个端到端自动语音识别(ASR)系统,提供高性能,低潜潜伏;触发注意机制,由CTC激增引发自动递减解码,在流出ASR方面证明是有效的;然而,为了保持基于CTC产出的校正估计的高度准确性,这是其性能的关键,因此,在未来的信息输入(即,高延缓度)下解码是不可避免的;应当指出,在流出ASR时,最好能够实现高度识别准确性,同时保持低潜伏;因此,本研究报告旨在通过引入Make-CT来非常精确地流出ASR,从而能够了解预测未来信息的特征描述(即,可以考虑长期背景)到编译前训练阶段。使用WSJ数据进行的实验性比较表明,拟议方法的精确度比常规的ASR导引引流系统要低。