The Transformer architecture has been well adopted as a dominant architecture in most sequence transduction tasks including automatic speech recognition (ASR), since its attention mechanism excels in capturing long-range dependencies. While models built solely upon attention can be better parallelized than regular RNN, a novel network architecture, SRU++, was recently proposed. By combining the fast recurrence and attention mechanism, SRU++ exhibits strong capability in sequence modeling and achieves near-state-of-the-art results in various language modeling and machine translation tasks with improved compute efficiency. In this work, we present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks and study how the benefits can be generalized to long-form speech inputs. On the popular LibriSpeech benchmark, our SRU++ model achieves 2.0% / 4.7% WER on test-clean / test-other, showing competitive performances compared with the state-of-the-art Conformer encoder under the same set-up. Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.
翻译:在包括自动语音识别(ASR)在内的大多数顺序转换任务中,转型结构被很好地作为主导结构被采纳,因为其关注机制在捕捉远程依赖性方面十分出色。虽然完全基于关注的模型可以比普通 RNN更好地平行,但最近提出了新颖网络架构(SRU++) 。通过将快速重现和关注机制相结合,SRU+在制作各种语言建模和机器翻译任务方面表现出强大的能力,并且提高了计算效率,取得了近于最新水平的结果。在这项工作中,我们介绍了在ASR任务中应用SRU+的优势,在多个 ASR基准中与Conex比较,并研究如何将惠益普及到长式语音投入。在流行的 LibriSpeech 基准中,我们的SRU++模型在测试清洁/测试中实现了2.0%/4.7% WER,显示了与同一设置下的最新 Confrent 编码器相比的竞争性表现。具体地说,SRUD+可以以大幅度超过长式语音输入。