基于几何投影参考约束的多维评分标准导向奖励模型学习 (Multidimensional Rubric-oriented Reward Model Learning via Geometric Projection Reference Constraints)

The integration of large language models (LLMs) into medical practice offers transformative potential, yet their real-world clinical applicability remains constrained by critical alignment issues: (1) a misalignment between static evaluation benchmarks and the dynamic cognitive demands of clinical practice, (2) challenges in adapting to continuously evolving, multi-source medical standards, and (3) the limited capacity of conventional reward models to reflect nuanced, multi-dimensional medical quality criteria. To overcome these limitations, we introduce MR-RML (Multidimensional Rubric-oriented Reward Model Learning) with GPRC (Geometric Projection Reference Constraints)-a novel alignment framework that structured medical standards into a multi-perspective matrix to guide both data generation and model optimization. Our approach introduces three key innovations: (1) a medical standard system that embeds domain-specific guidelines throughout the training pipeline; (2) an independent multi-dimensional reward model that decomposes evaluation criteria, transitioning from rule-based or LLM-based scoring to internalized reward modeling for better evaluation performance; and (3) geometric projection reference constraints that translate clinical cognitive logic into mathematical regularization, aligning scoring gradients with clinical reasoning and facilitating training with synthetically generated data. Extensive evaluations on the authoritative medical benchmark Healthbench demonstrate that our method significantly boosts the performance of the base Qwen-32B model, with improvements of 45% on the full subset and 85% on the hard subset. It achieves state-of-the-art results among open-source LLMs, scoring 62.7 (full) and 44.7 (hard), while also surpassing the majority of closed-source models.

翻译：大型语言模型（LLMs）在医疗实践中的整合展现出变革性潜力，但其在真实临床场景中的适用性仍受限于关键的对齐问题：（1）静态评估基准与临床实践中动态认知需求之间的错位；（2）适应持续演进、多源医疗标准的挑战；（3）传统奖励模型在反映细粒度、多维医疗质量准则方面的能力不足。为克服这些局限，我们提出了MR-RML（多维评分标准导向奖励模型学习）框架，结合GPRC（几何投影参考约束）——一种新颖的对齐框架，将医疗标准结构化构建为多视角矩阵，以指导数据生成与模型优化。我们的方法引入了三项关键创新：（1）贯穿训练流程的领域专用指南嵌入医疗标准体系；（2）独立的多维奖励模型，通过分解评估准则，从基于规则或LLM的评分转向内化的奖励建模，以提升评估性能；（3）几何投影参考约束，将临床认知逻辑转化为数学正则化，使评分梯度与临床推理对齐，并促进基于合成生成数据的训练。在权威医疗基准Healthbench上的广泛评估表明，我们的方法显著提升了基础Qwen-32B模型的性能，在完整子集上提升45%，在困难子集上提升85%。该方法在开源LLMs中取得了最先进的结果，得分分别为62.7（完整）和44.7（困难），同时超越了多数闭源模型。