The Transformer architecture consists of self-attention and feed-forward networks (FFNs) which can be viewed as key-value memories according to previous works. However, FFN and traditional memory utilize different activation functions (i.e., ReLU and Softmax respectively), which makes them not equivalent. In this paper, we first rebuild the connections between FFN and key-value memory by conducting extensive studies on ReLU and Softmax, and find they are equivalent when adding an additional layer normalization module on Softmax. In addition, ReLU outperforms Softmax on both FFN and key-value memory when the number of value slots is large. We analyze the reasons and then explore this good property of ReLU on the self-attention network where the original Softmax activation performs poorly on long input sequences. We then propose a full ReLU architecture named ReLUFormer which performs better than the baseline Transformer on long sequence tasks such as document translation. This paper sheds light on the following points: 1) Softmax and ReLU use different normalization methods over elements which lead to different variances of results, and ReLU is good at dealing with a large number of key-value slots; 2) FFN and key-value memory are equivalent, and thus the Transformer can be viewed as a memory network where FFNs and self-attention networks are both key-value memories.
翻译:变压器结构由自控和进料推进网络(FFNs)组成,根据以前的工作,这些网络可以被视为关键值的记忆。 但是, FFFN 和传统记忆使用不同的激活功能(分别是 ReLU 和 Softmax), 这使得这些功能不相等。 在本文中, 我们首先通过对 ReLU 和 Softmax 进行广泛的研究, 重建FFFFFN 和关键值存储之间的连接, 并在添加一个在 Softmax 上添加一个额外的层常规化模块时发现它们等同。 此外, ReLU 在值位数很大时, FFFF和关键值存储器的软化和关键值存储存储器的软件, 我们分析并随后在自控网络上探索ReLU的良好特性。 我们然后建议一个名为 ReLUFFFF的完整结构, 在文件翻译等长序列任务上比基线变压器要好。 本文在以下各点显示光:1) Softmax and ReLU 的软值和RELU 等值的内值大的内值网络, 等值的内值的内值是不同的内值格式的内值格式的内值格式。