Medical Visual Question Answering (VQA) is a multi-modal challenging task widely considered by research communities of the computer vision and natural language processing. Since most current medical VQA models focus on visual content, ignoring the importance of text, this paper proposes a multi-view attention-based model(MuVAM) for medical visual question answering which integrates the high-level semantics of medical images on the basis of text description. Firstly, different methods are utilized to extract the features of the image and the question for the two modalities of vision and text. Secondly, this paper proposes a multi-view attention mechanism that include Image-to-Question (I2Q) attention and Word-to-Text (W2T) attention. Multi-view attention can correlate the question with image and word in order to better analyze the question and get an accurate answer. Thirdly, a composite loss is presented to predict the answer accurately after multi-modal feature fusion and improve the similarity between visual and textual cross-modal features. It consists of classification loss and image-question complementary (IQC) loss. Finally, for data errors and missing labels in the VQA-RAD dataset, we collaborate with medical experts to correct and complete this dataset and then construct an enhanced dataset, VQA-RADPh. The experiments on these two datasets show that the effectiveness of MuVAM surpasses the state-of-the-art method.
翻译:医学视觉问题解答(VQA)是一项由计算机视觉和自然语言处理研究界广泛考虑的多模式挑战性任务。由于大多数目前的医学VQA模型侧重于视觉内容,忽视了文本的重要性,本文件建议了医学视觉解答的多视角关注模型(MuVAM),该模型根据文字描述整合了医学图像的高层次语义。首先,使用了不同的方法来提取图像的特征,以及两种视觉和文字跨模式特征之间的相似性。第二,本文件建议了一个多视角关注机制,包括图像到问题(I2Q)的注意和Word-to-Text(W2T)的注意。多视角关注可以将问题与图像和文字联系起来,以便更好地分析问题并获得准确的答案。第三,综合损失将用来准确预测多模式特征融合后的答案,并改进视觉和文字跨模式特征特征特征之间的相似性。本文提出了多视角关注机制,包括图像到问题(IQC)的注意机制,包括图像到问题(I2QD)的注意和W-T-T-T(W-T-T)的注意。多视角关注将问题与图像结果与图像分析方法联系起来,用于数据错误和数据构建数据,然后显示我们-RA-A的完整数据设置的数据和错误和错误的完整数据标签,从而显示我们的数据的完整的数据的完整数据。Q-错误和错误和错误和缺失。