Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models. However, existing methods rely solely on outcome rewards, without explicitly optimizing verification or leveraging reliable signals from realistic environments, leading to unreliable self-verification and limited test-time scaling. To address this, we widen the verification-generation asymmetry by explicitly optimizing self-verification, making it a reliable driver of deeper test-time scaling. We introduce ReVeal, a multi-turn reinforcement learning framework that evolves code generation through self-verification and tool-based evaluation. ReVeal structures long-horizon reasoning as iterative generation-verification turns and incorporates TAPO for turn-level credit assignment, fostering the co-evolution of code and test generation. At inference, this strengthened self-verification enables the model to use self-constructed tests and tool feedback to continuously evolve code for 20+ turns on LiveCodeBench despite training on only three. It also significantly improves Pass@k, indicating stronger exploration that expands the reasoning boundaries of the base model. These findings highlight the promise of ReVeal as a scalable paradigm for RL training and test-time scaling, paving the way for more robust and autonomous AI agents.
翻译:基于可验证奖励的强化学习(RLVR)提升了大型语言模型的推理能力。然而,现有方法仅依赖结果奖励,未显式优化验证过程或利用现实环境中的可靠信号,导致自验证不可靠且测试时扩展能力有限。为解决此问题,我们通过显式优化自验证来扩大验证与生成之间的不对称性,使其成为驱动更深层次测试时扩展的可靠机制。我们提出了ReVeal,一个多轮次强化学习框架,通过自验证和基于工具的评价来实现代码生成的演化。ReVeal将长程推理构建为迭代的生成-验证轮次,并引入TAPO进行轮次级信用分配,促进代码与测试生成的协同演化。在推理阶段,这种增强的自验证使模型能够利用自构建测试和工具反馈,在LiveCodeBench上持续演化代码超过20轮次,而训练时仅使用三轮。该方法还显著提升了Pass@k指标,表明更强的探索能力扩展了基础模型的推理边界。这些发现凸显了ReVeal作为可扩展的强化学习训练与测试时扩展范式的潜力,为构建更鲁棒、更自主的人工智能智能体铺平了道路。