Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language, which is an emerging research topic for both Natural Language Processing and Computer Vision. In this work, we introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages comprising semi-structured table(s) and unstructured text as well as 16,558 question-answer pairs by extending the TAT-QA dataset. These documents are sampled from real-world financial reports and contain lots of numbers, which means discrete reasoning capability is demanded to answer questions on this dataset. Based on TAT-DQA, we further develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions with corresponding strategies, i.e., extraction or reasoning. Extensive experiments show that the MHST model significantly outperforms the baseline methods, demonstrating its effectiveness. However, the performance still lags far behind that of expert humans. We expect that our new TAT-DQA dataset would facilitate the research on deep understanding of visually-rich documents combining vision and language, especially for scenarios that require discrete reasoning. Also, we hope the proposed model would inspire researchers to design more advanced Document VQA models in future. Our dataset will be publicly available for non-commercial use at https://nextplusplus.github.io/TAT-DQA/.
翻译:文件视觉问题解答(VQA)旨在理解视觉丰富的文件,以自然语言回答问题,这是自然语言处理和计算机视野的一个新兴研究课题。在这项工作中,我们推出一个新的文件VQA数据集,名为TAT-DQA,由3 067个半结构化表格和非结构化文本组成的文件页以及16 558个问答配对组成,扩大TAT-QA数据集。这些文件取自真实世界财务报告样本,包含大量数字,这意味着需要离散推理能力来回答该数据集的问题。在TAT-DQA的基础上,我们进一步开发了一个名为MHST的新模型,其中考虑到多种模式的信息,包括文本、布局和视觉图像,通过相应的战略,即提取或推理,明智地解决不同类型的问题。广泛的实验表明,MHST模式将大大超越现有的基准方法,表明其有效性。然而,在专家的TAT-D模型中,业绩仍然远远落后于该模型。我们期望,我们的未来设计文件需要更加清晰化的模型,我们需要我们的未来设计文件。