We show the formal equivalence of linearised self-attention mechanisms and fast weight controllers from the early '90s, where a ``slow" neural net learns by gradient descent to program the ``fast weights" of another net through sequences of elementary programming instructions which are additive outer products of self-invented activation patterns (today called keys and values). Such Fast Weight Programmers (FWPs) learn to manipulate the contents of a finite memory and dynamically interact with it. We infer a memory capacity limitation of recent linearised softmax attention variants, and replace the purely additive outer products by a delta rule-like programming instruction, such that the FWP can more easily learn to correct the current mapping from keys to values. The FWP also learns to compute dynamically changing learning rates. We also propose a new kernel function to linearise attention which balances simplicity and effectiveness. We conduct experiments on synthetic retrieval problems as well as standard machine translation and language modelling tasks which demonstrate the benefits of our methods.
翻译:我们展示了90年代初线性自留机制和快速重量控制器的正式等值,从90年代初开始,一个“慢”神经网通过梯度下降学习,编程另一个网的“快重”,通过基本编程指令序列,这些指令是自发激活模式(今天称为键和值)的外附加产品。这种快速重力程序员学会操纵有限内存的内容并与它动态地互动。我们推断了最近线性软体注意变体的内存能力限制,用类似三角洲规则的编程指示取代纯粹的添加外产产品,这样,FWP可以更容易地学习从键到值的当前绘图。FWP还学会了动态变化学习率的计算。我们还提出了一种新的内核功能,以线性关注平衡简单和有效性。我们进行了合成检索问题实验,并进行了标准机器翻译和语言建模任务,以展示我们方法的好处。