The domain of joint vision-language understanding, especially in the context of reasoning in Visual Question Answering (VQA) models, has garnered significant attention in the recent past. While most of the existing VQA models focus on improving the accuracy of VQA, the way models arrive at an answer is oftentimes a black box. As a step towards making the VQA task more explainable and interpretable, our method is built upon the SOTA VQA framework by augmenting it with an end-to-end explanation generation module. In this paper, we investigate two network architectures, including Long Short-Term Memory (LSTM) and Transformer decoder, as the explanation generator. Our method generates human-readable textual explanations while maintaining SOTA VQA accuracy on the GQA-REX (77.49%) and VQA-E (71.48%) datasets. Approximately 65.16% of the generated explanations are approved by humans as valid. Roughly 60.5% of the generated explanations are valid and lead to the correct answers.
翻译:共同愿景语言理解的领域,特别是在视觉问答模型的推理方面,最近引起了人们的极大关注。虽然大多数现有的 VQA 模型侧重于提高 VQA 的准确性,但模型到达答案的方式往往是一个黑盒。作为使VQA任务更能解释和解释的一步,我们的方法建立在SOTA VQA框架上,方法是通过一个端到端解释生成模块来增加它。在本文中,我们调查了两个网络结构,包括长期短期内存(LSTM)和变换器解码器,作为解释生成者。我们的方法产生人类可读的文本解释,同时在GQA-REX(77.49%)和VQA-E(71.48%)数据集上保持SOTA VQA-E(71.48%)的准确性。大约65.16%的解释得到人类的认可。大约60.5%的解释是有效的,并导致正确的答案。