Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to $\{0,1\}$, but imperfect verifiers inevitably introduce \emph{false negatives} (rejecting correct answers) and \emph{false positives} (accepting incorrect ones). We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates $ρ_0$ and $ρ_1$ -- the FP rate and the FN rate, respectively. From this abstraction we derive two lightweight corrections: (i) a \emph{backward} correction that yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, and (ii) a \emph{forward} correction that reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization pipeline, both corrections improve RLVR for math reasoning under synthetic and real verifier noise, with the forward variant being more stable under heavier noise. Finally, an appeals mechanism with a lightweight LLM verifier estimates the FN rate online and further improves performance.


翻译:可验证奖励强化学习(RLVR)通过自动化验证器替代了昂贵的人工标注。为降低验证器被攻击的风险,许多RLVR系统将奖励二值化为 $\{0,1\}$,但不完美的验证器不可避免地会引入**假阴性**(拒绝正确答案)和**假阳性**(接受错误答案)。我们将验证器的不可靠性形式化为一个具有非对称噪声率 $ρ_0$ 和 $ρ_1$(分别对应假阳性率和假阴性率)的随机奖励信道。基于此抽象,我们推导出两种轻量级校正方法:(i)**后向校正**,可产生无偏的替代奖励,从而在期望上得到无偏的策略梯度估计器;(ii)**前向校正**,通过重新加权评分函数项,使期望更新与干净梯度方向对齐,且仅需假阴性率。我们将两者实现为组相对策略优化流程中的轻量级钩子,在合成及真实验证器噪声下的数学推理任务中,两种校正均提升了RLVR的性能,其中前向校正方法在较强噪声下表现更稳定。最后,我们引入一个基于轻量级LLM验证器的申诉机制,在线估计假阴性率,进一步提升了性能。

0
下载
关闭预览

相关内容

Top
微信扫码咨询专知VIP会员