Document-based Visual Question Answering examines the document understanding of document images in conditions of natural language questions. We proposed a new document-based VQA dataset, PDF-VQA, to comprehensively examine the document understanding from various aspects, including document element recognition, document layout structural understanding as well as contextual understanding and key information extraction. Our PDF-VQA dataset extends the current scale of document understanding that limits on the single document page to the new scale that asks questions over the full document of multiple pages. We also propose a new graph-based VQA model that explicitly integrates the spatial and hierarchically structural relationships between different document elements to boost the document structural understanding. The performances are compared with several baselines over different question types and tasks\footnote{The full dataset will be released after paper acceptance.
翻译:基于文档的视觉问答(VQA)研究自然语言问题下的文档理解,对于自然场景中的文档图像理解是具有挑战性的。我们设计了一个新的基于文档的VQA数据集PDF-VQA,用以全面地检验文档理解的各个方面,如文档元素识别、文档布局结构理解、上下文理解和关键信息提取等。我们的数据集将现有的文档理解研究从单一页面扩展到多页文档问答。同时,我们提出了一种新的基于图的VQA模型,明确地集成了不同文档元素之间的空间和层次结构关系以增强文档结构理解。我们在不同问题类型和任务上使用多个基线模型进行了比较和实验,将在论文接受后发布完整数据集。