Images in the medical domain are fundamentally different from the general domain images. Consequently, it is infeasible to directly employ general domain Visual Question Answering (VQA) models for the medical domain. Additionally, medical images annotation is a costly and time-consuming process. To overcome these limitations, we propose a solution inspired by self-supervised pretraining of Transformer-style architectures for NLP, Vision and Language tasks. Our method involves learning richer medical image and text semantic representations using Masked Language Modeling (MLM) with image features as the pretext task on a large medical image+caption dataset. The proposed solution achieves new state-of-the-art performance on two VQA datasets for radiology images -- VQA-Med 2019 and VQA-RAD, outperforming even the ensemble models of previous best solutions. Moreover, our solution provides attention maps which help in model interpretability. The code is available at https://github.com/VirajBagal/MMBERT
翻译:医疗领域的图像与一般域图象有根本的不同,因此,直接使用一般域医学领域的视觉问答模型是行不通的。此外,医疗图像说明是一个昂贵和耗时的过程。为了克服这些限制,我们提出了一个由自我监督的为NLP、愿景和语言任务对变异型结构进行预先培训所启发的解决方案。我们的方法涉及学习更丰富的医学图像和文本语义表象,使用带有图像特征的蒙面语言模型(MLMM)学习,作为大型医学图像加盖数据集的借口任务。拟议解决方案在两个VQA图像 -- -- VQA-Med 2019和VQA-RAD -- -- 上实现了新的最新状态的性能,甚至超过了先前最佳解决方案的混合模型。此外,我们的解决方案提供了有助于模型解释的注意地图。代码见https://github.com/VirigBagal/MMBERT。