Recent studies on machine reading comprehension have focused on text-level understanding but have not yet reached the level of human understanding of the visual layout and content of real-world documents. In this study, we introduce a new visual machine reading comprehension dataset, named VisualMRC, wherein given a question and a document image, a machine reads and comprehends texts in the image to answer the question in natural language. Compared with existing visual question answering (VQA) datasets that contain texts in images, VisualMRC focuses more on developing natural language understanding and generation abilities. It contains 30,000+ pairs of a question and an abstractive answer for 10,000+ document images sourced from multiple domains of webpages. We also introduce a new model that extends existing sequence-to-sequence models, pre-trained with large-scale text corpora, to take into account the visual layout and content of documents. Experiments with VisualMRC show that this model outperformed the base sequence-to-sequence models and a state-of-the-art VQA model. However, its performance is still below that of humans on most automatic evaluation metrics. The dataset will facilitate research aimed at connecting vision and language understanding.
翻译:有关机器阅读理解的最新研究侧重于对文本的理解,但尚未达到人类对真实世界文件的视觉版面和内容的理解水平。在本研究中,我们引入了一个新的视觉机器阅读理解数据集,名为视觉MRC,其中给出了一个问题和文件图像,一台机器读并理解图像中的文字以自然语言回答问题。与含有图像文本的现有视觉回答数据集相比,视觉MRC更侧重于培养自然语言理解和生成能力。它包含30,000对问题和对来自网页多个领域的10,000多份文件图像的抽象答案。我们还引入了一种新模型,扩展现有的序列到序列模型,事先经过大规模文本组合培训,以考虑到文件的视觉版面和内容。与视觉MRC实验显示,该模型超越了基本序列模型和顺序模型以及一款式VQA模型。然而,该模型的性能仍然低于人类在大多数自动评估指标视图方面的性能。数据设置将促进数据设置研究,目的是将大多数自动评估语言连接起来。