The two most common paradigms for end-to-end speech recognition are connectionist temporal classification (CTC) and attention-based encoder-decoder (AED) models. It has been argued that the latter is better suited for learning an implicit language model. We test this hypothesis by measuring temporal context sensitivity and evaluate how the models perform when we constrain the amount of contextual information in the audio input. We find that the AED model is indeed more context sensitive, but that the gap can be closed by adding self-attention to the CTC model. Furthermore, the two models perform similarly when contextual information is constrained. Finally, in contrast to previous research, our results show that the CTC model is highly competitive on WSJ and LibriSpeech without the help of an external language model.
翻译:终端到终端语音识别的两个最常见的模式是连接式时间分类和关注式编码器编码器编码器(AED)模型,认为后者更适合学习隐含语言模型,我们通过衡量时间背景敏感性来测试这一假设,并评估当我们限制音频输入中背景信息的数量时,模型是如何运作的。我们发现AED模型确实更具有背景敏感性,但可以通过增加对CTC模型的自我注意来弥补差距。此外,在背景信息受到限制时,这两种模型也具有相似性。最后,与以往的研究相比,我们的结果显示,在没有外部语言模型帮助的情况下,CTC模型对WSJ和LibriSpeech具有高度竞争力。