SPIRAL：通过多智能体多轮次强化学习在零和博弈中实现自我对弈激励推理 (SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning)

Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on human-curated problem-answer pairs and domain-specific reward engineering. We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision. Through self-play, SPIRAL generates an infinite curriculum of progressively challenging problems as models must constantly adapt to stronger opponents. To enable this self-play training at scale, We implement a fully online, multi-turn, multi-agent reinforcement learning system for LLMs and propose role-conditioned advantage estimation (RAE) to stabilize multi-agent training. Using SPIRAL, self-play on zero-sum games produces reasoning capabilities that transfer broadly. Training Qwen3-4B-Base on Kuhn Poker alone achieves 8.6% improvement on math and 8.4% on general reasoning, outperforming SFT on 25,000 expert game trajectories. Analysis reveals that this transfer occurs through three cognitive patterns: systematic decomposition, expected value calculation, and case-by-case analysis. Multi-game training (TicTacToe, Kuhn Poker, Simple Negotiation) further enhances performance as each game develops distinct reasoning strengths. Applying SPIRAL to a strong reasoning model (DeepSeek-R1-Distill-Qwen-7B) can still lead to 2.0% average improvement. These results demonstrate that zero-sum games naturally develop transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

翻译：近期强化学习进展表明，语言模型可通过在可验证奖励任务上的训练发展复杂推理能力，但这些方法依赖于人工标注的问题-答案对及领域特定的奖励工程。本文提出SPIRAL，一种自我对弈框架，模型通过与其持续进化的版本进行多轮次零和博弈来学习，无需人工监督。通过自我对弈，SPIRAL生成无限渐进式难度问题课程，因为模型必须不断适应更强的对手。为实现大规模自我对弈训练，我们为LLM构建了完全在线、多轮次、多智能体强化学习系统，并提出角色条件优势估计（RAE）以稳定多智能体训练。使用SPIRAL进行零和博弈自我对弈可产生广泛迁移的推理能力：仅在Kuhn Poker上训练Qwen3-4B-Base模型，即可在数学任务上提升8.6%，通用推理任务上提升8.4%，优于使用25,000条专家博弈轨迹的监督微调。分析表明，这种迁移通过三种认知模式实现：系统化分解、期望值计算与逐案例分析。多游戏训练（井字棋、Kuhn Poker、简单谈判）可进一步提升性能，因为不同游戏能发展互补的推理优势。将SPIRAL应用于强推理模型（DeepSeek-R1-Distill-Qwen-7B）仍能带来平均2.0%的性能提升。这些结果表明，零和博弈可自然发展出可迁移的推理能力，为自主推理发展指明了新方向。