Softmax attention is a central component of transformer architectures, yet its nonlinear structure poses significant challenges for theoretical analysis. We develop a unified, measure-based framework for studying single-layer softmax attention under both finite and infinite prompts. For i.i.d. Gaussian inputs, we lean on the fact that the softmax operator converges in the infinite-prompt limit to a linear operator acting on the underlying input-token measure. Building on this insight, we establish non-asymptotic concentration bounds for the output and gradient of softmax attention, quantifying how rapidly the finite-prompt model approaches its infinite-prompt counterpart, and prove that this concentration remains stable along the entire training trajectory in general in-context learning settings with sub-Gaussian tokens. In the case of in-context linear regression, we use the tractable infinite-prompt dynamics to analyze training at finite prompt length. Our results allow optimization analyses developed for linear attention to transfer directly to softmax attention when prompts are sufficiently long, showing that large-prompt softmax attention inherits the analytical structure of its linear counterpart. This, in turn, provides a principled and broadly applicable toolkit for studying the training dynamics and statistical behavior of softmax attention layers in large prompt regimes.
翻译:Softmax注意力是Transformer架构的核心组件,但其非线性结构给理论分析带来了重大挑战。我们开发了一个统一的、基于测度的框架,用于研究有限和无限提示符下的单层Softmax注意力。对于独立同分布的高斯输入,我们基于以下事实:Softmax算子在无限提示符极限下收敛于作用于底层输入标记测度的线性算子。基于这一洞见,我们建立了Softmax注意力输出和梯度的非渐近集中界,量化了有限提示符模型趋近其无限提示符对应模型的速度,并证明在具有亚高斯标记的一般上下文学习设置中,这种集中性在整个训练轨迹上保持稳定。在上下文线性回归的情况下,我们利用可处理的无限提示符动态来分析有限提示符长度下的训练。我们的结果表明,当提示符足够长时,为线性注意力开发的优化分析可以直接迁移到Softmax注意力,表明大提示符Softmax注意力继承了其线性对应物的分析结构。这反过来为研究大提示符机制下Softmax注意力层的训练动态和统计行为提供了一个原则性且广泛适用的工具包。