Visual question answering on document images that contain textual, visual, and layout information, called document VQA, has received much attention recently. Although many datasets have been proposed for developing document VQA systems, most of the existing datasets focus on understanding the content relationships within a single image and not across multiple images. In this study, we propose a new multi-image document VQA dataset, SlideVQA, containing 2.6k+ slide decks composed of 52k+ slide images and 14.5k questions about a slide deck. SlideVQA requires complex reasoning, including single-hop, multi-hop, and numerical reasoning, and also provides annotated arithmetic expressions of numerical answers for enhancing the ability of numerical reasoning. Moreover, we developed a new end-to-end document VQA model that treats evidence selection and question answering in a unified sequence-to-sequence format. Experiments on SlideVQA show that our model outperformed existing state-of-the-art QA models, but that it still has a large gap behind human performance. We believe that our dataset will facilitate research on document VQA.
翻译:在包含文字、视觉和布局信息的文档图像(称为VQA文件)上解答的视觉问题最近引起了许多关注。虽然为开发文档 VQA 系统提出了许多数据集,但大多数现有数据集侧重于了解单一图像中的内容关系,而不是跨多个图像。在本研究中,我们提出了一个新的多图像文档VQA数据集,幻灯片VQA,包含2.6k+幻灯片甲板,由52k+幻灯片图像和14.5k关于幻灯片甲板的问题组成。幻灯片VQA需要复杂的推理,包括单拍、多拍和数字推理,并提供了数字解答的附加数字表达,以加强数字推理能力。此外,我们开发了一个新的从终端到终端文件VQA模型,处理证据选择和问题,以统一的顺序到顺序格式回答。幻灯片QA实验显示,我们的模型比现有的QA状态模型要差,但人类性能后面仍有很大的缺口。我们认为,我们的数据集将便利对VQA文件的研究。