Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.
翻译:尽管大语言模型展现出高级推理能力,传统对齐方法仍主要依赖于仅评判最终答案的结果奖励模型。过程奖励模型通过评估和引导步骤或轨迹层面的推理,弥补了这一不足。本综述通过完整闭环系统性地概述过程奖励模型:如何生成过程数据、构建过程奖励模型,并将其用于测试时扩展与强化学习。我们总结了其在数学、代码、文本、多模态推理、机器人及智能体等领域的应用,并评述了新兴基准测试。本文旨在厘清设计空间、揭示开放挑战,并为实现细粒度、鲁棒的推理对齐指明未来研究方向。