Although RLVR has become an essential component for developing advanced reasoning skills in LLMs, contemporary studies have documented training plateaus that emerge following thousands of optimization steps, demonstrating notable decreases in performance gains despite increased computational investment. This limitation stems from the sparse exploration patterns inherent in current RLVR practices, where models rely on limited rollouts that often miss critical reasoning paths and fail to provide systematic coverage of the solution space. We present DeepSearch, a framework that integrates Monte Carlo Tree Search directly into RLVR training. In contrast to existing methods that rely on tree search only at inference, DeepSearch embeds structured search into the training loop, enabling systematic exploration and fine-grained credit assignment across reasoning steps. Through training-time exploration, DeepSearch addresses the fundamental bottleneck of insufficient exploration, which leads to diminishing performance improvements over prolonged training steps. Our contributions include: (1) a global frontier selection strategy that prioritizes promising nodes across the search tree, (2) selection with entropy-based guidance that identifies confident paths for supervision, and (3) adaptive replay buffer training with solution caching for efficiency. Experiments on mathematical reasoning benchmarks show that DeepSearch achieves 62.95% average accuracy and establishes a new state-of-the-art for 1.5B reasoning models - using 5.7x fewer GPU hours than extended training approaches. These results highlight the importance of strategic exploration over brute-force scaling and demonstrate the promise of algorithmic innovation for advancing RLVR methodologies. DeepSearch establishes a new direction for scaling reasoning capabilities through systematic search rather than prolonged computation.
翻译:尽管可验证奖励强化学习已成为大型语言模型发展高级推理能力的关键组成部分,但近期研究记录了优化步骤达到数千次后出现的训练平台期,表明尽管计算投入增加,性能增益却显著下降。这一局限源于当前可验证奖励强化学习实践中固有的稀疏探索模式,即模型依赖有限的推演,这些推演常遗漏关键推理路径,且无法系统覆盖解空间。我们提出DeepSearch框架,将蒙特卡洛树搜索直接集成到可验证奖励强化学习训练中。与现有方法仅在推理时使用树搜索不同,DeepSearch将结构化搜索嵌入训练循环,实现跨推理步骤的系统性探索和细粒度信用分配。通过训练时探索,DeepSearch解决了探索不足这一根本瓶颈,该瓶颈导致长期训练步骤中性能改进逐渐减弱。我们的贡献包括:(1) 全局前沿节点选择策略,优先处理搜索树中有潜力的节点;(2) 基于熵的引导选择机制,识别可用于监督的高置信度路径;(3) 结合解缓存的适应性回放缓冲区训练以提高效率。在数学推理基准测试上的实验表明,DeepSearch实现了62.95%的平均准确率,为15亿参数推理模型建立了新的性能标杆——所用GPU时数比扩展训练方法少5.7倍。这些结果凸显了策略性探索相对于暴力扩展的重要性,并证明了算法创新对推进可验证奖励强化学习方法的潜力。DeepSearch通过系统性搜索而非延长计算时间,为扩展推理能力确立了新方向。