Reinforcement learning has powered many of the recent breakthroughs in large language models, especially for tasks where rewards can be computed automatically, such as code generation. However, these methods deteriorate in open-ended domains like medical consultation, where feedback is inherently ambiguous, highly context-dependent, and cannot be reduced to a reliable scalar signal. In such settings, RL must either rely on supervision-intensive reward models that often fail to generalize, or it falls into pathological behaviors such as reward hacking - an especially troubling risk for high-stakes medical dialogue. To address these limitations, we introduce ORBIT, an open-ended rubric-based incremental training framework for high-stakes medical dialogue. ORBIT integrates synthetic dialogue generation with dynamically constructed rubrics that serve as adaptive guides for incremental RL. Instead of relying on external medical knowledge bases or handcrafted rule sets, ORBIT uses rubric-driven feedback to steer the learning process. Its judge component can be instantiated with general-purpose instruction-following LLMs, removing the need for any task-specific fine-tuning. Applied to the Qwen3-4B-Instruct model, ORBIT raises the HealthBench-Hard score from 7.0 to 27.5 using only 2k training samples, achieving SOTA performance for models at this scale. With larger rubric datasets, ORBIT-trained models further compete with the strongest open-source baselines on HealthBench-Hard. Our analysis shows that rubric-guided RL consistently improves consultation quality across diverse medical scenarios. We also apply such rubric generation and training pipeline to InfoBench, where ORBIT enhances instruction-following performance, highlighting the generality of rubric-based feedback.
翻译:强化学习推动了大型语言模型的许多最新突破,尤其是在奖励可以自动计算的任务上,例如代码生成。然而,这些方法在开放式领域(如医疗咨询)中表现不佳,因为该领域的反馈本质上具有模糊性、高度依赖于上下文,且无法简化为可靠的标量信号。在此类场景中,强化学习要么依赖于监督密集的奖励模型(这些模型往往难以泛化),要么陷入病态行为(如奖励攻击)——这对于高风险医疗对话而言尤其令人担忧。为应对这些限制,我们提出了ORBIT,一个用于高风险医疗对话的开放式基于量规的增量训练框架。ORBIT将合成对话生成与动态构建的量规相结合,这些量规作为增量强化学习的自适应指导。ORBIT不依赖外部医学知识库或手工规则集,而是利用量规驱动的反馈来引导学习过程。其评判组件可由通用指令遵循型大型语言模型实例化,无需任何任务特定的微调。应用于Qwen3-4B-Instruct模型时,ORBIT仅使用2k训练样本就将HealthBench-Hard分数从7.0提升至27.5,实现了该规模模型的最先进性能。使用更大的量规数据集时,经ORBIT训练的模型在HealthBench-Hard上进一步与最强的开源基线模型竞争。我们的分析表明,量规引导的强化学习能持续提升多样化医疗场景下的咨询质量。我们还将此类量规生成和训练流程应用于InfoBench,其中ORBIT增强了指令遵循性能,凸显了基于量规的反馈的通用性。