We present a novel multimodal interpretable VQA model that can answer the question more accurately and generate diverse explanations. Although researchers have proposed several methods that can generate human-readable and fine-grained natural language sentences to explain a model's decision, these methods have focused solely on the information in the image. Ideally, the model should refer to various information inside and outside the image to correctly generate explanations, just as we use background knowledge daily. The proposed method incorporates information from outside knowledge and multiple image captions to increase the diversity of information available to the model. The contribution of this paper is to construct an interpretable visual question answering model using multimodal inputs to improve the rationality of generated results. Experimental results show that our model can outperform state-of-the-art methods regarding answer accuracy and explanation rationality.
翻译:我们提出了一个新颖的多式联运可解释VQA模型,可以更准确地解答问题,并产生多种解释。虽然研究人员提出了几种方法,可以产生人类可读和精细的自然语言句子来解释模型的决定,但这些方法只侧重于图像中的信息。理想的情况是,该模型应参考图像内外的各种信息,以便正确地作出解释,就像我们每天使用背景知识一样。拟议方法包括来自外部知识的信息和多个图像字幕,以增加模型可获取的信息的多样性。本文的贡献是建立一个可解释的直观回答模型,使用多式投入提高生成结果的合理性。实验结果显示,我们的模型可以超越关于回答准确性和解释合理性的最新方法。</s>