Nemotron-Cascade：面向通用推理模型的级联强化学习规模化方法 (Nemotron-Cascade: Scaling Cascaded Reinforcement Learning for General-Purpose Reasoning Models)

from arxiv, We publicly release the Nemotron-Cascade models and the full collection of training data at: https://huggingface.co/collections/nvidia/nemotron-cascade

Building general-purpose reasoning models with reinforcement learning (RL) entails substantial cross-domain heterogeneity, including large variation in inference-time response lengths and verification latency. Such variability complicates the RL infrastructure, slows training, and makes training curriculum (e.g., response length extension) and hyperparameter selection challenging. In this work, we propose cascaded domain-wise reinforcement learning (Cascade RL) to develop general-purpose reasoning models, Nemotron-Cascade, capable of operating in both instruct and deep thinking modes. Departing from conventional approaches that blend heterogeneous prompts from different domains, Cascade RL orchestrates sequential, domain-wise RL, reducing engineering complexity and delivering state-of-the-art performance across a wide range of benchmarks. Notably, RLHF for alignment, when used as a pre-step, boosts the model's reasoning ability far beyond mere preference optimization, and subsequent domain-wise RLVR stages rarely degrade the benchmark performance attained in earlier domains and may even improve it (see an illustration in Figure 1). Our 14B model, after RL, outperforms its SFT teacher, DeepSeek-R1-0528, on LiveCodeBench v5/v6/Pro and achieves silver-medal performance in the 2025 International Olympiad in Informatics (IOI). We transparently share our training and data recipes.

翻译：利用强化学习构建通用推理模型面临显著的跨领域异质性挑战，包括推理时响应长度与验证延迟的大幅波动。这种变异性使强化学习基础设施复杂化、减缓训练进程，并为训练课程（如响应长度扩展）与超参数选择带来困难。本研究提出级联领域强化学习方法，以开发具备指令与深度思考双模式的通用推理模型Nemotron-Cascade。区别于传统混合多领域异构提示的方法，级联强化学习通过顺序化、分领域的强化学习编排，降低工程复杂度，并在广泛基准测试中实现最先进性能。值得注意的是，作为前置步骤的基于人类反馈的强化学习对齐，能显著提升模型的推理能力，其效果远超单纯的偏好优化；而后续分领域的强化学习验证与修正阶段，不仅极少降低前期领域已取得的基准性能，甚至可能带来提升（见图1示例）。我们经过强化学习的140亿参数模型，在LiveCodeBench v5/v6/Pro上超越其监督微调教师模型DeepSeek-R1-0528，并在2025年国际信息学奥林匹克竞赛中达到银牌水平。我们公开分享了完整的训练流程与数据构建方案。