There is a key problem in the medical visual question answering task that how to effectively realize the feature fusion of language and medical images with limited datasets. In order to better utilize multi-scale information of medical images, previous methods directly embed the multi-stage visual feature maps as tokens of same size respectively and fuse them with text representation. However, this will cause the confusion of visual features at different stages. To this end, we propose a simple but powerful multi-stage feature fusion method, MF2-MVQA, which stage-wise fuses multi-level visual features with textual semantics. MF2-MVQA achieves the State-Of-The-Art performance on VQA-Med 2019 and VQA-RAD dataset. The results of visualization also verify that our model outperforms previous work.
翻译:医学视觉问题回答任务中存在一个关键问题,即如何有效地实现语言和医学图像与有限数据集的融合特征。为了更好地利用医学图像的多级信息,以往的方法是直接将多级视觉特征地图作为相同大小的象征直接嵌入多级视觉特征图中,并将之与文本表示相融合。然而,这将在不同阶段造成视觉特征的混淆。为此,我们建议了一种简单而有力的多阶段特征融合方法MF2-MVQA, 即MF2-MVQA, 分阶段将多级视觉特征与文字语义结合。MF2-MVQA在VQA-Med 2019 和VQA-RAD数据集上实现了国家艺术表现。视觉化的结果还证实我们的模型比以前的工作要好。