Pretrained Transformers excel at in-context learning (ICL), inferring new tasks from only a handful of examples. Yet, their ICL performance can degrade sharply under distribution shift between pretraining and test data, a regime increasingly common in real-world deployments. While recent empirical work hints that adjusting the attention temperature in the softmax can enhance Transformer performance, the attention temperature's role in ICL under distribution shift remains unexplored. This paper provides the first theoretical and empirical study of attention temperature for ICL under distribution shift. Using a simplified but expressive "linearized softmax" framework, we derive closed-form generalization error expressions and prove that shifts in input covariance or label noise substantially impair ICL, but that an optimal attention temperature exists which minimizes this error. We then validate our predictions through extensive simulations on linear regression tasks and large-scale experiments with GPT-2 and LLaMA2-7B on question-answering benchmarks. Our results establish attention temperature as a principled and powerful mechanism for improving the robustness of ICL in pretrained Transformers, advancing theoretical understanding and providing actionable guidance for selecting attention temperature in practice.
翻译:预训练的Transformer模型在上下文学习(ICL)方面表现卓越,仅通过少量示例即可推断新任务。然而,当预训练数据与测试数据之间存在分布偏移时——这一情况在现实部署中日益普遍——其ICL性能可能急剧下降。尽管近期实证研究表明,调整softmax中的注意力温度可以提升Transformer的性能,但注意力温度在分布偏移下对ICL的作用机制尚未得到探索。本文首次从理论与实证角度研究了分布偏移下ICL的注意力温度问题。通过一个简化但具表达力的“线性化softmax”框架,我们推导出泛化误差的闭式表达式,并证明输入协方差或标签噪声的偏移会显著损害ICL,但存在一个最优注意力温度能够最小化该误差。随后,我们通过线性回归任务的大量模拟实验,以及在GPT-2和LLaMA2-7B模型上对问答基准的大规模实验验证了理论预测。研究结果表明,注意力温度是提升预训练Transformer中ICL鲁棒性的一个原理性且有效的机制,不仅推进了理论理解,也为实践中选择注意力温度提供了可操作的指导。