Tell-and-Answer: 使用属性和标题寻找可解释的视觉问题解答 (Tell-and-Answer: Towards Explainable Visual Question Answering using Attributes and Captions)

Visual Question Answering (VQA) has attracted attention from both computer vision and natural language processing communities. Most existing approaches adopt the pipeline of representing an image via pre-trained CNNs, and then using the uninterpretable CNN features in conjunction with the question to predict the answer. Although such end-to-end models might report promising performance, they rarely provide any insight, apart from the answer, into the VQA process. In this work, we propose to break up the end-to-end VQA into two steps: explaining and reasoning, in an attempt towards a more explainable VQA by shedding light on the intermediate results between these two steps. To that end, we first extract attributes and generate descriptions as explanations for an image using pre-trained attribute detectors and image captioning models, respectively. Next, a reasoning module utilizes these explanations in place of the image to infer an answer to the question. The advantages of such a breakdown include: (1) the attributes and captions can reflect what the system extracts from the image, thus can provide some explanations for the predicted answer; (2) these intermediate results can help us identify the inabilities of both the image understanding part and the answer inference part when the predicted answer is wrong. We conduct extensive experiments on a popular VQA dataset and dissect all results according to several measurements of the explanation quality. Our system achieves comparable performance with the state-of-the-art, yet with added benefits of explainability and the inherent ability to further improve with higher quality explanations.

翻译：视觉问题解答(VQA)吸引了计算机视觉和自然语言处理社区的注意。大多数现有方法都采用通过预先培训的CNN来代表图像的管道,然后使用无法解释的CNN功能来预测答案。虽然这种端对端模型可能报告有希望的性能,但除了答案之外,它们很少对VQA进程提供任何洞察力。在这项工作中,我们建议将终端对端VQA分成两步:(1) 属性和字幕可以反映系统从图像中提取的内容,从而对预测的答案作出一些解释;(2) 这些中间结果可以帮助我们通过使用预先培训的属性探测器和图像说明模型分别提取属性和描述作为图像解释的图解。接下来,一个推理模块利用这些解释来代替图像的答案来推断问题的答案。这种解析的优点包括:(1) 属性和字幕可以反映系统从图像中提取的内容,从而对预测的答案作出一些解释;(2) 为此,我们首先提取属性和描述作为图像解释解释解释的解释的说明,然后在预测性结果的准确性部分中,我们用预测性分析的系统得出了各种结果的不稳性结果。