Document Visual Question Answering (DocVQA) refers to the task of answering questions from document images. Existing work on DocVQA only considers single-page documents. However, in real scenarios documents are mostly composed of multiple pages that should be processed altogether. In this work we extend DocVQA to the multi-page scenario. For that, we first create a new dataset, MP-DocVQA, where questions are posed over multi-page documents instead of single pages. Second, we propose a new hierarchical method, Hi-VT5, based on the T5 architecture, that overcomes the limitations of current methods to process long multi-page documents. The proposed method is based on a hierarchical transformer architecture where the encoder summarizes the most relevant information of every page and then, the decoder takes this summarized information to generate the final answer. Through extensive experimentation, we demonstrate that our method is able, in a single stage, to answer the questions and provide the page that contains the relevant information to find the answer, which can be used as a kind of explainability measure.
翻译:文件视觉问题解答(DocVQA) 指的是从文件图像中回答问题的任务。 DocVQA 的现有工作只考虑单页文件。 但是, 在真实情况下, 文件大多由应全部处理的多页组成。 在这项工作中,我们将DocVQA 扩展至多页的设想。 为此,我们首先创建一个新的数据集, MP- DocVQA, 其中的问题来自多页文件, 而不是单页文件。 其次, 我们提议一种新的等级化方法, Hi- VT5, 以T5 结构为基础, 克服当前处理长长页文件的方法的局限性。 提议的方法基于一个等级变压器结构, 编码器在这种结构中汇总了每页最相关的信息, 然后, 解码器使用这一汇总信息来生成最终答案。 我们通过广泛的实验, 证明我们的方法能够在单阶段回答问题, 并提供包含找到答案的相关信息的页面, 可以用作某种解释性衡量尺度。