Medical image visual question answering (VQA) is a task to answer clinical questions, given a radiographic image, which is a challenging problem that requires a model to integrate both vision and language information. To solve medical VQA problems with a limited number of training data, pretrain-finetune paradigm is widely used to improve the model generalization. In this paper, we propose a self-supervised method that applies Masked image modeling, Masked language modeling, Image text matching and Image text alignment via contrastive learning (M2I2) for pretraining on medical image caption dataset, and finetunes to downstream medical VQA tasks. The proposed method achieves state-of-the-art performance on all the three public medical VQA datasets. Our codes and models are available at https://github.com/pengfeiliHEU/M2I2.
翻译:医学视觉问题解答(VQA)是一项回答临床问题的任务,具有放射图像,这是一个具有挑战性的问题,需要一种将视觉和语言信息相结合的模式。为了用数量有限的培训数据解决医学VQA问题,广泛采用预先培训-Finnetune范式来改进模型的概括化。在本文中,我们提出一种自我监督的方法,在医学图像说明数据集前的培训中,通过对比性学习(M2I2)应用蒙面图像建模、蒙面语言建模、图像文本匹配和图像文本对齐(M2I2),以及下游医学VQA任务的微调。拟议方法在所有三种公共医学VQA数据集中都取得了最先进的表现。我们的代码和模型可在https://github.com/pengfeiliHEU/M2I2上查阅。