Infographics are documents designed to effectively communicate information using a combination of textual, graphical and visual elements. In this work, we explore the automatic understanding of infographic images by using Visual Question Answering technique.To this end, we present InfographicVQA, a new dataset that comprises a diverse collection of infographics along with natural language questions and answers annotations. The collected questions require methods to jointly reason over the document layout, textual content, graphical elements, and data visualizations. We curate the dataset with emphasis on questions that require elementary reasoning and basic arithmetic skills. Finally, we evaluate two strong baselines based on state of the art multi-modal VQA models, and establish baseline performance for the new task. The dataset, code and leaderboard will be made available at http://docvqa.org
翻译:图表是利用文本、图形和视觉要素相结合有效交流信息的文件。在这项工作中,我们利用视觉问答技术探索对信息图像的自动理解。为此目的,我们提出InfographVQA,这是一个新的数据集,包括各种信息以及自然语言问答说明,收集的问题要求用各种方法共同解释文件的布局、文字内容、图形元素和数据可视化。我们整理数据集,重点是需要基本推理和基本算术技能的问题。最后,我们根据现代多式VQA模型的状况评估两个强有力的基线,并为新任务确定基线性能。数据集、代码和领导板将在http://docvqa.org上公布。