TemplateRL：面向大语言模型推理的结构化模板引导强化学习 (TemplateRL: Structured Template-Guided Reinforcement Learning for LLM Reasoning)

Reinforcement learning (RL) has emerged as an effective paradigm for enhancing model reasoning. However, existing RL methods like GRPO often rely on unstructured self-sampling to fit scalar rewards, often producing inefficient rollouts that fail to capture transferable problem-solving strategies. To address these limitations, we propose **TemplateRL**, a structured template-guided RL framework that augments policy optimization with explicit template guidance. Our approach first constructs a problem-solving template library via MCTS on a small seed set, then seamlessly integrates this high-level structured guidance into RL training. By guiding rollout generation to align with proven template structures, TemplateRL significantly improves high-quality trajectory hit rates while reducing ineffective exploration. This structure-guided design steers the policy toward validated strategic patterns, stabilizing training dynamics, and enhancing RL sampling efficiency. Notably, the explicit template library is interpretable, editable, and supports online updates-enabling continuous updates during both training and inference. Extensive experiments demonstrate that TemplateRL outperforms GRPO by 99% on AIME and 41% on AMC, with superior stability on weak models and remarkable cross-domain generalization, highlighting its potential for broader tasks.

翻译：强化学习已成为提升模型推理能力的有效范式。然而，现有强化学习方法（如GRPO）通常依赖非结构化自采样来拟合标量奖励，往往产生低效的轨迹采样，难以捕获可迁移的问题解决策略。为克服这些局限，我们提出**TemplateRL**——一种结构化模板引导的强化学习框架，通过显式模板指导增强策略优化。该方法首先在小型种子集上通过蒙特卡洛树搜索构建问题解决模板库，随后将这种高层次结构化指导无缝集成至强化学习训练中。通过引导轨迹生成与已验证的模板结构对齐，TemplateRL显著提高了高质量轨迹的命中率，同时减少了无效探索。这种结构引导的设计将策略导向经过验证的战略模式，稳定了训练动态，并提升了强化学习的采样效率。值得注意的是，显式模板库具备可解释性、可编辑性，并支持在线更新——可在训练和推理阶段持续更新。大量实验表明，TemplateRL在AIME数据集上性能超越GRPO达99%，在AMC数据集上提升41%，且在弱模型上表现出更优的稳定性及卓越的跨领域泛化能力，彰显了其在更广泛任务中的应用潜力。