Document-based Visual Question Answering examines the document understanding of document images in conditions of natural language questions. We proposed a new document-based VQA dataset, PDF-VQA, to comprehensively examine the document understanding from various aspects, including document element recognition, document layout structural understanding as well as contextual understanding and key information extraction. Our PDF-VQA dataset extends the current scale of document understanding that limits on the single document page to the new scale that asks questions over the full document of multiple pages. We also propose a new graph-based VQA model that explicitly integrates the spatial and hierarchically structural relationships between different document elements to boost the document structural understanding. The performances are compared with several baselines over different question types and tasks\footnote{The full dataset will be released after paper acceptance.
翻译:基于文档的视觉问答研究针对自然语言问题检查文档图像的文档理解。我们提出了一个新的面向文档的 VQA 数据集 PDF-VQA,以全面检查从各个方面对文档的理解,包括文档元素识别,文档布局结构理解以及上下文理解和关键信息提取。我们的 PDF-VQA 数据集将文档理解的当前规模从单个文档页面扩展到提出了针对多个页面的整个文档的多个问题。我们还提出了一种基于图的 VQA 模型,显式地集成了不同文档元素之间的空间层次结构关系,以提高文档结构化理解。在不同问题类型和任务上,我们的性能与多个基线进行了比较。【完整数据集将在论文接受后公开发布。】