Gradually growing the depth of Transformers during training can not only reduce training cost but also lead to improved reasoning performance, as shown by MIDAS (Saunshi et al., 2024). Thus far, however, a mechanistic understanding of these gains has been missing. In this work, we establish a connection to recent work showing that layers in the second half of non-grown, pre-layernorm Transformers contribute much less to the final output distribution than those in the first half - also known as the Curse of Depth (Sun et al., 2025, Csordás et al., 2025). Using depth-wise analyses, we demonstrate that growth via gradual middle stacking yields more effective utilization of model depth, alters the residual stream structure, and facilitates the formation of permutable computational blocks. In addition, we propose a lightweight modification of MIDAS that yields further improvements in downstream reasoning benchmarks. Overall, this work highlights how the gradual growth of model depth can lead to the formation of distinct computational circuits and overcome the limited depth utilization seen in standard non-grown models.
翻译:如MIDAS(Saunshi等人,2024年)所示,在训练过程中逐步增加Transformer的深度不仅能降低训练成本,还能提升推理性能。然而,迄今为止,对这些增益的机制性理解尚属空白。本研究将这一现象与近期研究建立联系,该研究指出,在非增长、预层归一化Transformer中,后半部分网络层对最终输出分布的贡献远小于前半部分——这一现象被称为“深度诅咒”(Sun等人,2025年;Csordás等人,2025年)。通过逐层深度分析,我们证明:通过渐进式中间堆叠实现的深度增长能更有效地利用模型深度,改变残差流结构,并促进可置换计算块的形成。此外,我们提出一种轻量级改进的MIDAS方案,在后续推理基准测试中实现了进一步性能提升。总体而言,本研究揭示了模型深度的渐进式增长如何促成独特计算电路的形成,从而克服标准非增长模型中存在的深度利用受限问题。