Recently, attention-based encoder-decoder (AED) models have shown high performance for end-to-end automatic speech recognition (ASR) across several tasks. Addressing overconfidence in such models, in this paper we introduce the concept of relaxed attention, which is a simple gradual injection of a uniform distribution to the encoder-decoder attention weights during training that is easily implemented with two lines of code. We investigate the effect of relaxed attention across different AED model architectures and two prominent ASR tasks, Wall Street Journal (WSJ) and Librispeech. We found that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models. On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art (4.20%) by 13.1% relative, while introducing only a single hyperparameter.
翻译:最近,基于关注的编码器解码器(AED)模型显示,终端到终端自动语音识别(ASR)在多项任务中表现良好。 解决这些模型中的过度自信问题,我们在本文中引入了放松关注的概念,这是在培训过程中向编码器解码器解码器统一分配统一关注重量的简单渐进式注射,在培训过程中很容易用两行代码执行。我们调查了不同AED模型结构和两个突出的ASR任务(Wall Street Journal(WSJ)和Librispeech)中放松关注的影响。我们发现,在与外部语言模型解码时,经过轻松关注的变压器比标准基线模型一致。在WSJ上,我们为基于变压器的终端到终端语音识别设定了新的基准,单词误率为3.65%,比艺术状态(4.20%)快13.1%,而只采用单一的超标。