Linear layers in neural networks (NNs) trained by gradient descent can be expressed as a key-value memory system which stores all training datapoints and the initial weights, and produces outputs using unnormalised dot attention over the entire training experience. While this has been technically known since the 1960s, no prior work has effectively studied the operations of NNs in such a form, presumably due to prohibitive time and space complexities and impractical model sizes, all of them growing linearly with the number of training patterns which may get very large. However, this dual formulation offers a possibility of directly visualising how an NN makes use of training patterns at test time, by examining the corresponding attention weights. We conduct experiments on small scale supervised image classification tasks in single-task, multi-task, and continual learning settings, as well as language modelling, and discuss potentials and limits of this view for better understanding and interpreting how NNs exploit training patterns. Our code is public.
翻译:以梯度下降方式培训的神经网络的线性层可被表述为一种关键值记忆系统,该系统储存所有培训数据点和初始重量,并在整个培训过程中使用未经正规化的点注意来产生产出。虽然从1960年代以来在技术上是已知的,但以前没有一项工作以这种形式有效地研究了NN的运行情况,可能是由于时间和空间的复杂性令人望而望而却步以及不切实际的模型大小,所有这些都随着培训模式的增多而线性地增长。然而,这一双重表述提供了一种可能性,通过检查相应的关注权重,直接直观NN如何在测试时使用培训模式。我们在单轨、多任务和持续学习环境中进行小规模监督性图像分类任务的实验,以及语言建模,并讨论这种观点的潜力和局限性,以便更好地了解和解释NNS如何利用培训模式。我们的代码是公开的。