Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL's effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model's edge of competence, tasks at the boundary that are difficult but not yet out of reach. 2) Contextual generalization requires minimal yet sufficient pre-training exposure, after which RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared with RL only, demonstrating its central but underexplored role in training pipelines. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these results clarify the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies.
翻译:近期强化学习(RL)技术在语言模型的推理能力提升方面取得了显著成果,但后训练是否真正扩展了模型在预训练之外获得的推理能力仍不明确。核心挑战在于现代训练流程缺乏可控性:大规模预训练语料库不透明,中期训练常被忽视,且RL目标与未知的先验知识以复杂方式相互作用。为澄清这一问题,我们构建了一个完全受控的实验框架,以分离预训练、中期训练和基于RL的后训练的因果贡献。该方法采用合成推理任务,包含显式的原子操作、可解析的逐步推理轨迹以及对训练分布的系统性操控。我们从两个维度评估模型:面向更复杂组合的外推泛化能力,以及跨表层上下文的语境泛化能力。利用该框架,我们调和了关于RL有效性的对立观点。研究表明:1)仅当预训练留有足够提升空间且RL数据针对模型能力边界(即困难但尚未超出能力范围的任务)时,RL才能产生真实的能力增益(pass@128)。2)语境泛化需要最小但充分的预训练暴露,此后RL可稳定实现迁移。3)在固定计算量下,中期训练相比仅使用RL能显著提升性能,揭示了其在训练流程中核心但未被充分探索的作用。4)过程级奖励能减少奖励破解行为并提升推理保真度。这些结果共同阐明了预训练、中期训练与RL之间的相互作用,为理解和改进推理语言模型的训练策略奠定了基础。