We present CRM (Multi-Agent Collaborative Reward Model), a framework that replaces a single black-box reward model with a coordinated team of specialist evaluators to improve robustness and interpretability in RLHF. Conventional reward models struggle to jointly optimize multiple, sometimes conflicting, preference dimensions (e.g., factuality, helpfulness, safety) and offer limited transparency into why a score is assigned. CRM addresses these issues by decomposing preference evaluation into domain-specific agents that each produce partial signals, alongside global evaluators such as ranker-based and embedding-similarity rewards. A centralized aggregator fuses these signals at each timestep, balancing factors like step-wise correctness, multi-agent agreement, and repetition penalties, yielding a single training reward compatible with standard RL pipelines. The policy is optimized with advantage-based updates (e.g., GAE), while a value model regresses to the aggregated reward, enabling multi-perspective reward shaping without requiring additional human annotations beyond those used to train the evaluators. To support training and assessment, we introduce rewardBench, a benchmark and training suite aligned with the collaborative structure of CRM. Together, CRM and rewardBench provide a practical, modular path to more transparent reward modeling and more stable optimization.
翻译:我们提出了CRM(多智能体协同奖励模型),该框架通过一个由专业评估器组成的协调团队来替代单一的黑盒奖励模型,以提升RLHF的鲁棒性和可解释性。传统奖励模型难以同时优化多个可能相互冲突的偏好维度(如事实性、助益性、安全性),且对评分依据缺乏透明度。CRM通过将偏好评估分解为特定领域的智能体来解决这些问题,每个智能体产生部分信号,并结合基于排序和嵌入相似度奖励等全局评估器。中央聚合器在每个时间步融合这些信号,平衡逐步正确性、多智能体一致性和重复惩罚等因素,最终生成与标准RL流程兼容的单一训练奖励。策略通过基于优势的更新(如GAE)进行优化,而价值模型则回归到聚合奖励,从而在不额外增加人工标注的情况下实现多视角奖励塑造。为支持训练和评估,我们引入了rewardBench——一个与CRM协同结构对齐的基准测试与训练套件。CRM与rewardBench共同为更透明的奖励建模和更稳定的优化提供了实用化、模块化的实现路径。