Reinforcement learning with verifiable rewards (RLVR) has proven effective in training large reasoning models (LRMs) by leveraging answer-verifiable signals to guide policy optimization, which, however, suffers from high annotation costs. To alleviate this problem, recent work has explored unsupervised RLVR methods that derive rewards solely from the model's internal consistency, such as through entropy and majority voting. While seemingly promising, these methods often suffer from model collapse in the later stages of training, which may arise from the reinforcement of incorrect reasoning patterns in the absence of external supervision. In this work, we investigate a novel semi-supervised RLVR paradigm that utilizes a small labeled set to guide RLVR training on unlabeled samples. Our key insight is that supervised rewards are essential for stabilizing consistency-based training on unlabeled samples, ensuring that only reasoning patterns verified on labeled instances are incorporated into RL training. Technically, we propose an effective policy optimization algorithm, TraPO, that identifies reliable unlabeled samples by matching their learning trajectory similarity to labeled ones. Building on this, TraPO achieves remarkable data efficiency and strong generalization on six widely used mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). With only 1K labeled and 3K unlabeled samples, TraPO reaches 42.6% average accuracy, surpassing the best unsupervised method trained on 45K unlabeled samples (38.3%). Notably, when using 4K labeled and 12K unlabeled samples, TraPO even outperforms the fully supervised model trained on the full 45K labeled samples on all benchmarks, while using only 10% of the labeled data. The code is available via https://github.com/ShenzhiYang2000/TRAPO.
翻译:基于可验证奖励的强化学习(RLVR)已被证明在训练大型推理模型(LRMs)方面是有效的,它通过利用答案可验证信号来指导策略优化。然而,该方法存在标注成本高昂的问题。为了缓解这一问题,近期研究探索了无监督的RLVR方法,这些方法仅从模型的内部一致性(例如通过熵和多数投票)推导奖励。尽管这些方法看似前景广阔,但它们通常在训练的后期阶段遭遇模型崩溃,这可能源于在缺乏外部监督的情况下强化了错误的推理模式。在本研究中,我们探索了一种新颖的半监督RLVR范式,该范式利用少量标注样本来指导对未标注样本的RLVR训练。我们的核心见解是,监督奖励对于稳定基于一致性的未标注样本训练至关重要,确保只有经过标注实例验证的推理模式才会被纳入强化学习训练。在技术上,我们提出了一种有效的策略优化算法TraPO,该算法通过匹配未标注样本与标注样本的学习轨迹相似性来识别可靠的未标注样本。基于此,TraPO在六个广泛使用的数学推理基准测试(AIME24/25、AMC、MATH-500、Minerva和Olympiad)以及三个分布外任务(ARC-c、GPQA-diamond和MMLU-pro)上实现了显著的数据效率和强大的泛化能力。仅使用1K标注样本和3K未标注样本,TraPO就达到了42.6%的平均准确率,超越了在45K未标注样本上训练的最佳无监督方法(38.3%)。值得注意的是,当使用4K标注样本和12K未标注样本时,TraPO在所有基准测试上甚至超越了使用全部45K标注样本训练的完全监督模型,而仅使用了10%的标注数据。代码可通过 https://github.com/ShenzhiYang2000/TRAPO 获取。