Transformers have become the state-of-the-art neural network architecture across numerous domains of machine learning. This is partly due to their celebrated ability to transfer and to learn in-context based on few examples. Nevertheless, the mechanisms by which Transformers become in-context learners are not well understood and remain mostly an intuition. Here, we argue that training Transformers on auto-regressive tasks can be closely related to well-known gradient-based meta-learning formulations. We start by providing a simple weight construction that shows the equivalence of data transformations induced by 1) a single linear self-attention layer and by 2) gradient-descent (GD) on a regression loss. Motivated by that construction, we show empirically that when training self-attention-only Transformers on simple regression tasks either the models learned by GD and Transformers show great similarity or, remarkably, the weights found by optimization match the construction. Thus we show how trained Transformers implement gradient descent in their forward pass. This allows us, at least in the domain of regression problems, to mechanistically understand the inner workings of optimized Transformers that learn in-context. Furthermore, we identify how Transformers surpass plain gradient descent by an iterative curvature correction and learn linear models on deep data representations to solve non-linear regression tasks. Finally, we discuss intriguing parallels to a mechanism identified to be crucial for in-context learning termed induction-head (Olsson et al., 2022) and show how it could be understood as a specific case of in-context learning by gradient descent learning within Transformers.
翻译:变异器已成为机器学习众多领域的最先进的神经网络架构,这部分是由于他们以少数实例为基础,以惊人的转移和学习变异功能为特点,尽管如此,变异器成为变异器内学习者的机制并没有得到很好理解,而且基本上仍是一种直觉。在这里,我们认为,对自动反向任务培训变异器可以与众所周知的梯度上层的元学习配方密切相关。我们首先提供一个简单的权重结构,显示数据变异的等值,由1个单一线性自省层和2个关于倒退损失的梯度(GD)所引起。受此结构的驱动,我们从经验上显示,当对自我注意变异器内学习的变异器进行简单回归任务培训时,要么GD和变异器所学的模型表现出非常相似的相似性,要么通过优化发现比重与构造相近。因此,我们展示了经过培训的变异变异器在其前过路中如何使用梯级下降。这至少是在回归问题领域,让我们以机械化方式了解最精确的变异性变异的内变异的内位,然后,然后又在正变异性变变变变异的变变变变变异器中如何在我们学习中学习中学习。