Most existing works in visual question answering (VQA) are dedicated to improving the accuracy of predicted answers, while disregarding the explanations. We argue that the explanation for an answer is of the same or even more importance compared with the answer itself, since it makes the question and answering process more understandable and traceable. To this end, we propose a new task of VQA-E (VQA with Explanation), where the computational models are required to generate an explanation with the predicted answer. We first construct a new dataset, and then frame the VQA-E problem in a multi-task learning architecture. Our VQA-E dataset is automatically derived from the VQA v2 dataset by intelligently exploiting the available captions. We have conducted a user study to validate the quality of explanations synthesized by our method. We quantitatively show that the additional supervision from explanations can not only produce insightful textual sentences to justify the answers, but also improve the performance of answer prediction. Our model outperforms the state-of-the-art methods by a clear margin on the VQA v2 dataset.
翻译:视觉问题解答(VQA)中的大多数现有作品都致力于提高预测答案的准确性,而忽略了解释。我们争辩说,与答案本身相比,对答案的解释是相同的,甚至更重要,因为它使问答过程更容易理解和追踪。为此,我们提议了VQA-E(有解释的VQA-E(VQA-E))的新任务,其中计算模型需要用预测答案作出解释。我们首先建立一个新的数据集,然后在一个多任务学习结构中将VQA-E问题框框起来。我们的VQA-E数据集自动从通过明智地利用现有的图表而建立的VQA v2数据集中衍生出来。我们进行了用户研究,以验证我们方法所合成的解释的质量。我们量化地表明,从解释中进行的额外监督不仅能够产生有洞察的文字句子来证明答案是正确的,而且还可以改进答案预测的性能。我们的模型比VQA v2数据集明显比出最新的方法。