We tackle the challenge of Visual Question Answering in multi-image setting for the ISVQA dataset. Traditional VQA tasks have focused on a single-image setting where the target answer is generated from a single image. Image set VQA, however, comprises of a set of images and requires finding connection between images, relate the objects across images based on these connections and generate a unified answer. In this report, we work with 4 approaches in a bid to improve the performance on the task. We analyse and compare our results with three baseline models - LXMERT, HME-VideoQA and VisualBERT - and show that our approaches can provide a slight improvement over the baselines. In specific, we try to improve on the spatial awareness of the model and help the model identify color using enhanced pre-training, reduce language dependence using adversarial regularization, and improve counting using regression loss and graph based deduplication. We further delve into an in-depth analysis on the language bias in the ISVQA dataset and show how models trained on ISVQA implicitly learn to associate language more strongly with the final answer.
翻译:我们应对在ISVQA数据集多图像设置中的视觉问答的挑战。传统的VQA任务侧重于单一图像设置,其目标答案来自单一图像。图像设置VQA由一组图像组成,要求图像之间找到联系,根据这些连接将图像联系起来,并生成统一答案。在本报告中,我们与4种方法合作,以改进任务绩效。我们分析并比较了我们的结果与三个基线模型――LXMERT、HME-VideoQA和视觉BERT――的对比,并表明我们的方法可以比基线稍有改进。具体地说,我们试图提高模型的空间意识,帮助模型使用强化的预培训来识别颜色,使用对抗性规范减少语言依赖性,以及使用回归损失和基于脱光度的图表改进计算。我们进一步深入分析了ISVQA数据集中的语言偏差,并展示了在ISVQA上接受过间接培训的模型如何学会将语言与最终答案更紧密地联系起来。