Dynamic evaluation of language models (LMs) adapts model parameters at test time using gradient information from previous tokens and substantially improves LM performance. However, it requires over 3x more compute than standard inference. We present Fast Weight Layers (FWLs), a neural component that provides the benefits of dynamic evaluation much more efficiently by expressing gradient updates as linear attention. A key improvement over dynamic evaluation is that FWLs can also be applied at training time so the model learns to make good use of gradient updates. FWLs can easily be added on top of existing transformer models, require relatively little extra compute or memory to run, and significantly improve language modeling perplexity.
翻译:对语言模型的动态评价(LMs)在测试时使用以前象征物的梯度信息对模型参数进行调整,并大大改进LM性能。然而,它需要比标准推论更精确的3x以上的计算。我们展示了快速重力层(FWLs),这是一个神经元件,它通过将梯度更新表述为线性关注而更高效地提供动态评价的好处。动态评价的一个重要改进是,FWLs也可以在培训时应用,以便模型学习如何很好地利用梯度更新。 FWLs可以很容易地添加在现有变压器模型之上,需要相对较少的额外计算或内存运行,并大大改进语言模型的两重性。