In this paper, we describe the use of recurrent neural networks to capture sequential information from the self-attention representations to improve the Transformers. Although self-attention mechanism provides a means to exploit long context, the sequential information, i.e. the arrangement of tokens, is not explicitly captured. We propose to cascade the recurrent neural networks to the Transformers, which referred to as the TransfoRNN model, to capture the sequential information. We found that the TransfoRNN models which consists of only shallow Transformers stack is suffice to give comparable, if not better, performance than a deeper Transformer model. Evaluated on the Penn Treebank and WikiText-2 corpora, the proposed TransfoRNN model has shown lower model perplexities with fewer number of model parameters. On the Penn Treebank corpus, the model perplexities were reduced up to 5.5% with the model size reduced up to 10.5%. On the WikiText-2 corpus, the model perplexity was reduced up to 2.2% with a 27.7% smaller model. Also, the TransfoRNN model was applied on the LibriSpeech speech recognition task and has shown comparable results with the Transformer models.
翻译:在本文中,我们描述使用经常性神经网络从自我注意显示器中获取顺序信息以改善变异器的情况。 虽然自我注意机制提供了利用长环境的手段, 但是没有明确记录顺序信息, 即象征性的排列。 我们提议将经常性神经网络升级到称为 TransfoRNNN 模型的变异器, 以捕捉顺序信息。 我们发现, TransfoRNN 模型仅由浅色变换器堆叠组成, 足以提供比更深层变异器模型更具有可比性的性能。 在Penn Treebank和WikitText-2 Corpora 中, 拟议的 TransfoRNN 模型显示的模型不那么复杂, 模型参数数量较少。 在Penn Treebank 系统中, 模型的变异度减少到5.5%, 模型的大小减少到10.5%。 在Wikittext-2号中, 模型的变异度减少到2. 2%, 比27.7%的模型小。 另外, TransfoRNNN 模型在Lis-Speech 语音识别任务上应用了类似的结果。