Neural sequence models, especially transformers, exhibit a remarkable capacity for in-context learning. They can construct new predictors from sequences of labeled examples $(x, f(x))$ presented in the input without further parameter updates. We investigate the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly, by encoding smaller models in their activations, and updating these implicit models as new examples appear in the context. Using linear regression as a prototypical problem, we offer three sources of evidence for this hypothesis. First, we prove by construction that transformers can implement learning algorithms for linear models based on gradient descent and closed-form ridge regression. Second, we show that trained in-context learners closely match the predictors computed by gradient descent, ridge regression, and exact least-squares regression, transitioning between different predictors as transformer depth and dataset noise vary, and converging to Bayesian estimators for large widths and depths. Third, we present preliminary evidence that in-context learners share algorithmic features with these predictors: learners' late layers non-linearly encode weight vectors and moment matrices. These results suggest that in-context learning is understandable in algorithmic terms, and that (at least in the linear case) learners may rediscover standard estimation algorithms. Code and reference implementations released at this $\href{https://github.com/ekinakyurek/google-research/blob/master/incontext}{http\,link}$.
翻译:神经序列模型, 特别是变压器, 展示了惊人的内置学习能力。 它们可以从输入中的标签示例序列中, $( x, f (x) 美元) 中构建新的预测值, 而没有进一步的参数更新 。 我们调查了一个假设, 即基于变压器的内同源学习者在启动时将较小的模型编码, 并更新这些隐含的模型作为新的实例出现于背景中。 使用线性回归作为原型问题, 我们为这一假设提供了三种证据来源 。 首先, 我们通过建筑证明, 变压器可以使用基于梯度下沉和封闭式脊脊回归的线性模型的序列来实施新的预测值。 其次, 我们显示, 受过培训的同源学习者密切匹配以梯度下降、 脊柱回归和精确最小最小的最小方位回归算法计算的预测值, 在不同变压深度和数据设置的噪音变压器之间进行转换, 我们向Bayesian 估测算器提供三大宽度和深度的三种证据 。 第三, 我们提供初步证据表明, com 变压式学习者与这些预测器与这些测算师分享了这些测算器的参数的参数特征特征特征特征特征: 深层非线级 数据 和判判判判判判判判判判算值 判算值 判算 。