稳定高效的单轮次强化学习用于多模态推理 (Stable and Efficient Single-Rollout RL for Multimodal Reasoning)

Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this training efficiency-stability trade-off, we introduce $\textbf{MSSR}$ (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps. When trained for the same number of steps, MSSR's performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks.

翻译：具有可验证奖励的强化学习已成为提升多模态大语言模型推理能力的关键范式。然而，当前主流的基于分组的算法（如GRPO）需要对每个提示进行多轮次采样。尽管近期在纯文本场景中已探索了更高效的单轮次变体，但我们发现这些方法在多模态环境中存在严重的不稳定性，常导致训练崩溃。为解决这一训练效率与稳定性之间的权衡问题，我们提出了$\textbf{MSSR}$（多模态稳定单轮次），一种无需分组的RLVR框架，能够同时实现稳定的优化和有效的多模态推理性能。MSSR通过一种基于熵的优势塑形机制达成此目标，该机制自适应地调节优势值幅度，防止崩溃并维持训练稳定性。虽然此类机制在基于分组的RLVR中已有应用，但我们证明在多模态单轮次场景中，它们不仅是有效的，更是维持稳定性的必要条件。在分布内评估中，MSSR展现出卓越的训练计算效率，仅用基线分组方法一半的训练步数即可达到相近的验证准确率。当使用相同训练步数时，MSSR的性能超越了分组基线，并在五个不同的推理密集型基准测试中展现出持续一致的泛化改进。综上，这些结果表明MSSR能够为复杂的多模态推理任务提供稳定、计算高效且有效的RLVR解决方案。