One of the most striking features of Large Language Models (LLMs) is their ability to learn in-context. Namely at inference time an LLM is able to learn new patterns without any additional weight update when these patterns are presented in the form of examples in the prompt, even if these patterns were not seen during training. The mechanisms through which this can happen are still largely unknown. In this work, we show that the stacking of a self-attention layer with an MLP, allows the transformer block to implicitly modify the weights of the MLP layer according to the context. We argue through theory and experimentation that this simple mechanism may be the reason why LLMs can learn in-context and not only during training. Specifically, we show how a transformer block implicitly transforms a context into a low-rank weight-update of its MLP layer.
翻译:大型语言模型(LLM)最显著的特征之一是其上下文学习能力。具体而言,在推理阶段,当新模式以提示中的示例形式呈现时,LLM能够在无需额外权重更新的情况下学习这些模式,即使这些模式在训练期间从未出现过。这种现象背后的机制在很大程度上仍是未知的。本研究表明,自注意力层与多层感知机(MLP)的堆叠结构,使得Transformer模块能够根据上下文隐式地调整MLP层的权重。我们通过理论分析与实验验证提出,这一简单机制可能是LLM不仅能在训练中学习,还能进行上下文学习的关键原因。具体而言,我们证明了Transformer模块如何将上下文隐式地转化为其MLP层的低秩权重更新。