知你所不知：过程奖励模型的不确定性校准 (Know What You Don't Know: Uncertainty Calibration of Process Reward Models)

Process reward models (PRMs) play a central role in guiding inference-time scaling algorithms for large language models (LLMs). However, we observe that even state-of-the-art PRMs can be poorly calibrated. Specifically, they tend to overestimate the success probability that a partial reasoning step will lead to a correct final answer, particularly when smaller LLMs are used to complete the reasoning trajectory. To address this, we present a calibration approach -- performed via quantile regression -- that adjusts PRM outputs to better align with true success probabilities. Leveraging these calibrated success estimates and their associated confidence bounds, we introduce an \emph{instance-adaptive scaling} (IAS) framework that dynamically adjusts the compute budget based on the estimated likelihood that a partial reasoning trajectory will yield a correct final answer. Unlike conventional methods that allocate a fixed number of reasoning trajectories per query, this approach adapts to each instance and reasoning step when using our calibrated PRMs. Experiments on mathematical reasoning benchmarks show that (i) our PRM calibration method achieves small calibration error, outperforming the baseline methods, (ii) calibration is crucial for enabling effective IAS, and (iii) the proposed IAS strategy reduces inference costs while maintaining final answer accuracy, utilizing less compute on more confident problems as desired.

翻译：过程奖励模型（PRMs）在指导大语言模型（LLMs）的推理时扩展算法中起着核心作用。然而，我们观察到即使是最先进的PRMs也可能存在校准不佳的问题。具体而言，它们倾向于高估部分推理步骤将导致最终正确答案的成功概率，尤其是在使用较小规模的LLMs来完成推理轨迹时。为解决此问题，我们提出了一种通过分位数回归执行的校准方法，该方法调整PRM输出以更好地与真实成功概率对齐。利用这些校准后的成功估计及其相关的置信边界，我们引入了一个实例自适应扩展（IAS）框架，该框架根据估计的部分推理轨迹产生正确答案的可能性动态调整计算预算。与为每个查询分配固定数量推理轨迹的传统方法不同，该方法在使用我们校准后的PRMs时，能适应每个实例和推理步骤。在数学推理基准测试上的实验表明：（i）我们的PRM校准方法实现了较小的校准误差，优于基线方法；（ii）校准对于实现有效的IAS至关重要；以及（iii）所提出的IAS策略在保持最终答案准确性的同时降低了推理成本，在更有信心的问题上按需使用更少的计算资源。