为何循环Transformer在理论上优于非递归模型（可证明） (What Makes Looped Transformers Perform Better Than Non-Recursive Ones (Provably))

While looped transformers (termed as Looped-Attn) often outperform standard transformers (termed as Single-Attn) on complex reasoning tasks, the theoretical basis for this advantage remains underexplored. In this paper, we explain this phenomenon through the lens of loss landscape geometry, inspired by empirical observations of their distinct dynamics at both sample and Hessian levels. To formalize this, we extend the River-Valley landscape model by distinguishing between U-shaped valleys (flat) and V-shaped valleys (steep). Based on empirical observations, we conjecture that the recursive architecture of Looped-Attn induces a landscape-level inductive bias towards River-V-Valley. Theoretical derivations based on this inductive bias guarantee a better loss convergence along the river due to valley hopping, and further encourage learning about complex patterns compared to the River-U-Valley induced by Single-Attn. Building on this insight, we propose SHIFT (Staged HIerarchical Framework for Progressive Training), a staged training framework that accelerates the training process of Looped-Attn while achieving comparable performances.

翻译：尽管循环Transformer（称为Looped-Attn）在复杂推理任务上通常优于标准Transformer（称为Single-Attn），其优势的理论基础仍未得到充分探索。本文通过损失函数景观几何的视角解释这一现象，灵感来源于对两者在样本和Hessian层面动态差异的实证观察。为形式化这一分析，我们扩展了河流-山谷景观模型，区分了U形山谷（平坦）和V形山谷（陡峭）。基于实证观察，我们推测Looped-Attn的递归架构在景观层面诱导了偏向河流-V-山谷的归纳偏置。基于该归纳偏置的理论推导保证了沿河流方向通过山谷跳跃实现更优的损失收敛，且相比Single-Attn诱导的河流-U-山谷，进一步促进对复杂模式的学习。基于这一洞见，我们提出SHIFT（渐进训练的分层阶段框架），一种分阶段训练框架，可加速Looped-Attn的训练过程，同时达到相当的性能。