无思维策略初始化使蒸馏推理模型成为更高效能效的推理器 (Thinking-Free Policy Initialization Makes Distilled Reasoning Models More Effective and Efficient Reasoners)

Reinforcement Learning with Verifiable Reward (RLVR) effectively solves complex tasks but demands extremely long context lengths during training, leading to substantial computational costs. While multi-stage training can partially mitigate this, starting with overly short contexts often causes irreversible performance degradation, ultimately failing to reduce overall training compute significantly. In this paper, we introduce **T**hinking-**F**ree **P**olicy **I**nitialization (**TFPI**), a simple yet effective adaptation to RLVR that bridges long Chain-of-Thought (CoT) distillation and standard RLVR. TFPI employs a simple *ThinkFree* operation, explicitly discarding the thinking content via a direct *</think>* append, to reduce token usage during inference. Training with *ThinkFree*-adapted inputs improves performance and lowers token consumption, even in the original slow-thinking mode. Extensive experiments across various benchmarks have shown that TFPI accelerates RL convergence, achieves a higher performance ceiling, and yields more token-efficient reasoning models without specialized rewards or complex training designs. With TFPI only, we train a 4B model to reach 89.0% accuracy on AIME24 and 65.5% on LiveCodeBench using less than 4K H20 hours.

翻译：可验证奖励强化学习（RLVR）能有效解决复杂任务，但在训练期间需要极长的上下文长度，导致巨大的计算成本。虽然多阶段训练可以部分缓解这一问题，但从过短的上下文开始通常会导致不可逆的性能下降，最终无法显著降低整体训练计算量。本文提出**无思维策略初始化（TFPI）**，这是一种对RLVR的简单而有效的适配方法，它桥接了长思维链（CoT）蒸馏与标准RLVR。TFPI采用一种简单的*无思维*操作，通过直接附加*</think>*标记来显式丢弃思维内容，从而减少推理过程中的令牌使用。使用*无思维*适配的输入进行训练，即使在原始的慢思维模式下，也能提高性能并降低令牌消耗。在多个基准测试上的广泛实验表明，TFPI加速了RL收敛，达到了更高的性能上限，并产生了更具令牌效率的推理模型，而无需专门的奖励或复杂的训练设计。仅使用TFPI，我们训练了一个4B模型，在AIME24上达到89.0%的准确率，在LiveCodeBench上达到65.5%的准确率，且使用了不到4K H20小时的计算时间。