Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks. However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR, where each word must be recognized shortly after it was spoken. In this work, we present the dual causal/non-causal self-attention (DCN) architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer when used in a deep architecture. DCN is compared to chunk-based and restricted self-attention using streaming transformer and conformer architectures, showing improved ASR performance over restricted self-attention and competitive ASR results compared to chunk-based self-attention, while providing the advantage of frame-synchronous processing. Combined with triggered attention, the proposed streaming end-to-end ASR systems obtained state-of-the-art results on the LibriSpeech, HKUST, and Switchboard ASR tasks.
翻译:在这项工作中,我们介绍了双重因果/非因果自动语音识别(DCN)架构,与有限的自我识别表现相比,在使用深层结构时,限制自留防止了超出单一层外观的总体环境增长。 DCN与使用流动变压器和相容器结构的块状和限制自留模型相比,显示ASR在有限自留和竞争性自留结果方面的表现有所改善,与基于块状的自留和自留结果相比,与基于块状的自留结果相比,与基于块状的自留效果相比,与基于块状的自留和竞争性自留效果相比,我们提出了双重因果/非因果自留(DCN)架构,同时提供了框架同步处理的优势。与触发的注意力相结合,拟议的流动终端自备系统在LibriSpeech、HKKust、ASR交换台上获得了最新结果。