The field of visual question answering (VQA) has recently seen a surge in research focused on providing explanations for predicted answers. However, current systems mostly rely on separate models to predict answers and generate explanations, leading to less grounded and frequently inconsistent results. To address this, we propose a multitask learning approach towards a Unified Model for Answer and Explanation generation (UMAE). Our approach involves the addition of artificial prompt tokens to training data and fine-tuning a multimodal encoder-decoder model on a variety of VQA-related tasks. In our experiments, UMAE models surpass the prior state-of-the-art answer accuracy on A-OKVQA by 10~15%, show competitive results on OK-VQA, achieve new state-of-the-art explanation scores on A-OKVQA and VCR, and demonstrate promising out-of-domain performance on VQA-X.
翻译:视觉问题解答(VQA)领域最近出现了以提供预测答案的解释为重点的研究激增。然而,目前的系统大多依赖不同的模型来预测答案和产生解释,导致结果基础较少且往往不一致。为了解决这个问题,我们建议采用多任务学习方法,以建立统一答案和解释生成模式(UMAE ) 。 我们的方法是增加人工快速标记,用于培训数据,并微调多式编码器解码器模型,以了解与VQA有关的各种任务。 在我们的实验中,UMAE模型比A-OKVQA 先前最先进的答案精确度高出10~15%,在 OK-VQA 上显示有竞争力的结果,在A-OKVQA 和 VCR 上取得新的最先进的解说分数,并在VQA-X上展示有希望的外在外表现。