Recent Reinforcement Learning (RL) advances for Large Language Models (LLMs) have improved reasoning tasks, yet their resource-constrained application to medical imaging remains underexplored. We introduce ChexReason, a vision-language model trained via R1-style methodology (SFT followed by GRPO) using only 2,000 SFT samples, 1,000 RL samples, and a single A100 GPU. Evaluations on CheXpert and NIH benchmarks reveal a fundamental tension: GRPO recovers in-distribution performance (23% improvement on CheXpert, macro-F1 = 0.346) but degrades cross-dataset transferability (19% drop on NIH). This mirrors high-resource models like NV-Reason-CXR-3B, suggesting the issue stems from the RL paradigm rather than scale. We identify a generalization paradox where the SFT checkpoint uniquely improves on NIH before optimization, indicating teacher-guided reasoning captures more institution-agnostic features. Furthermore, cross-model comparisons show structured reasoning scaffolds benefit general-purpose VLMs but offer minimal gain for medically pre-trained models. Consequently, curated supervised fine-tuning may outperform aggressive RL for clinical deployment requiring robustness across diverse populations.
翻译:近期针对大型语言模型(LLM)的强化学习(RL)进展提升了推理任务性能,但其在资源受限的医学影像领域的应用仍待深入探索。我们提出ChexReason——一个通过R1风格方法(先进行监督微调SFT,再进行GRPO)训练的视觉语言模型,仅使用2,000个SFT样本、1,000个RL样本和单张A100 GPU完成训练。在CheXpert和NIH基准测试上的评估揭示了一个根本性矛盾:GRPO恢复了分布内性能(CheXpert上提升23%,宏观F1分数=0.346),却削弱了跨数据集可迁移性(NIH上下降19%)。这一现象与NV-Reason-CXR-3B等高资源模型的表现一致,表明问题源于RL范式而非模型规模。我们发现了一种泛化悖论:在优化前,仅SFT检查点能独特提升NIH数据集性能,表明教师引导的推理能捕捉更多机构无关特征。此外,跨模型比较显示结构化推理框架对通用视觉语言模型有益,但对医学预训练模型增益有限。因此,对于需要跨多样化群体鲁棒性的临床部署,精心设计的监督微调可能优于激进的强化学习方法。