Medical visual question answering (VQA) aims to answer clinically relevant questions regarding input medical images. This technique has the potential to improve the efficiency of medical professionals while relieving the burden on the public health system, particularly in resource-poor countries. Existing medical VQA methods tend to encode medical images and learn the correspondence between visual features and questions without exploiting the spatial, semantic, or medical knowledge behind them. This is partially because of the small size of the current medical VQA dataset, which often includes simple questions. Therefore, we first collected a comprehensive and large-scale medical VQA dataset, focusing on chest X-ray images. The questions involved detailed relationships, such as disease names, locations, levels, and types in our dataset. Based on this dataset, we also propose a novel baseline method by constructing three different relationship graphs: spatial relationship, semantic relationship, and implicit relationship graphs on the image regions, questions, and semantic labels. The answer and graph reasoning paths are learned for different questions.
翻译:医学直观回答(VQA)旨在回答与医疗图像输入有关的临床相关问题,这一技术有可能提高医疗专业人员的效率,同时减轻公共卫生系统的负担,特别是在资源贫乏的国家;现有的医疗VQA方法倾向于将医疗图像编码,并学习视觉特征和问题之间的对应,而不必利用它们背后的空间、语义或医学知识;部分原因是目前的医学VQA数据集规模小,往往包括简单的问题;因此,我们首先收集了全面和大规模医学VQA数据集,重点是胸部X射线图像;问题涉及详细的关系,如疾病名称、地点、级别和我们数据集中的种类。基于这一数据集,我们还提出一个新的基线方法,即建立三种不同的关系图:空间关系、语义关系和关于图像区域、问题和语义标签的隐含关系图。为不同的问题学习了答案和图表推理方法。