Medical question answering requires advanced reasoning that integrates domain knowledge with logical inference. However, existing large language models (LLMs) often generate reasoning chains that lack factual accuracy and clinical reliability. We propose Ranked Preference Reinforcement Optimization (RPRO), a novel framework that combines reinforcement learning with preference-driven reasoning refinement to enhance clinical chain-of-thought (CoT) performance. RPRO distinguishes itself from prior approaches by employing task-adaptive reasoning templates and a probabilistic evaluation mechanism that aligns model outputs with established clinical workflows, while automatically identifying and correcting low-quality reasoning chains. Unlike traditional pairwise preference methods, RPRO introduces a groupwise ranking optimization based on the Bradley--Terry model and incorporates KL-divergence regularization for stable training. Experiments on PubMedQA, MedQA-USMLE, and a real-world clinical dataset from Far Eastern Memorial Hospital (FEMH) demonstrate consistent improvements over strong baselines. Remarkably, our 2B-parameter model outperforms much larger 7B--20B models, including medical-specialized variants. These findings demonstrate that combining preference optimization with quality-driven refinement provides a scalable and clinically grounded approach to building more reliable medical LLMs.
翻译:医学问答任务需要融合领域知识与逻辑推理的高级推理能力。然而,现有的大语言模型(LLMs)生成的推理链往往缺乏事实准确性与临床可靠性。本文提出排序偏好强化优化(Ranked Preference Reinforcement Optimization, RPRO),一种结合强化学习与偏好驱动推理优化的新型框架,旨在提升临床思维链(CoT)的表现。RPRO通过采用任务自适应的推理模板和概率评估机制,将模型输出与既定的临床工作流程对齐,同时自动识别并修正低质量推理链,从而区别于先前方法。与传统成对偏好方法不同,RPRO基于Bradley-Terry模型引入分组排序优化,并结合KL散度正则化以实现稳定训练。在PubMedQA、MedQA-USMLE以及来自远东纪念医院(FEMH)的真实临床数据集上的实验表明,该方法相对于强基线模型取得了持续改进。值得注意的是,我们的20亿参数模型性能超越了规模大得多的70亿至200亿参数模型,包括医学专用变体。这些发现证明,将偏好优化与质量驱动的精细化相结合,为构建更可靠的医学大语言模型提供了一种可扩展且临床基础扎实的途径。