Providing explanations for visual question answering (VQA) has gained much attention in research. However, most existing systems use separate models for predicting answers and providing explanations. We argue that training explanation models independently of the QA model makes the explanations less grounded and limits performance. To address this, we propose a multitask learning approach towards a Unified Model for more grounded and consistent generation of both Answers and Explanations (UMAE). To achieve this, we add artificial prompt tokens to training instances and finetune a multimodal encoder-decoder model on various VQA tasks. In our experiments, UMAE models surpass the prior SOTA answer accuracy on A-OKVQA by 10~15%, show competitive results on OK-VQA, achieve new SOTA explanation scores on A-OKVQA and VCR, and demonstrate promising out-of-domain performance on VQA-X.
翻译:为直观答题(VQA)提供解释在研究中引起了很大的关注。然而,大多数现有系统都使用不同的模型来预测答案和提供解释。我们争辩说,独立于QA模型的培训解释模型减少了解释的依据和限制性能。为此,我们提议采用多任务学习方法,为更有根据和一致地生成答案和解释的统一模型(UMAE ) 。为了做到这一点,我们在培训实例中添加了人工即时标记,并对多种多式编码器解码模型进行微调。 在我们的实验中,UMAE模型比以前SOTA A-OKVQA的回答准确度高出10~15%,显示在 OK-VQA 上的竞争结果,在A-OKVQA 和 VCR上实现新的SOTA解释分数,并展示VQA-X上有希望的外在外的绩效。