In this work, we introduce RadImageNet-VQA, a large-scale dataset designed to advance radiologic visual question answering (VQA) on CT and MRI exams. Existing medical VQA datasets are limited in scale, dominated by X-ray imaging or biomedical illustrations, and often prone to text-based shortcuts. RadImageNet-VQA is built from expert-curated annotations and provides 750K images paired with 7.5M question-answer samples. It covers three key tasks - abnormality detection, anatomy recognition, and pathology identification - spanning eight anatomical regions and 97 pathology categories, and supports open-ended, closed-ended, and multiple-choice questions. Extensive experiments show that state-of-the-art vision-language models still struggle with fine-grained pathology identification, particularly in open-ended settings and even after fine-tuning. Text-only analysis further reveals that model performance collapses to near-random without image inputs, confirming that RadImageNet-VQA is free from linguistic shortcuts. The full dataset and benchmark are publicly available at https://huggingface.co/datasets/raidium/RadImageNet-VQA.
翻译:本研究介绍了RadImageNet-VQA——一个专为推进CT与MRI检查的放射学视觉问答(VQA)而构建的大规模数据集。现有医学VQA数据集普遍存在规模有限、以X射线成像或生物医学插图为主、且易受文本捷径影响的问题。RadImageNet-VQA基于专家标注构建,包含75万张医学图像与750万个问答样本,涵盖异常检测、解剖结构识别及病理辨识三大核心任务,涉及八个解剖区域和97种病理类别,并支持开放式、封闭式及多选题三种问答形式。大量实验表明,当前最先进的视觉语言模型在细粒度病理辨识任务中仍面临挑战,尤其在开放式问答场景下,即使经过微调仍表现不佳。纯文本分析进一步揭示,若缺失图像输入,模型性能会骤降至接近随机水平,这证实了RadImageNet-VQA有效规避了语言捷径问题。完整数据集与基准测试已公开于https://huggingface.co/datasets/raidium/RadImageNet-VQA。