Current visual question answering (VQA) tasks mainly consider answering human-annotated questions for natural images. However, aside from natural images, abstract diagrams with semantic richness are still understudied in visual understanding and reasoning research. In this work, we introduce a new challenge of Icon Question Answering (IconQA) with the goal of answering a question in an icon image context. We release IconQA, a large-scale dataset that consists of 107,439 questions and three sub-tasks: multi-image-choice, multi-text-choice, and filling-in-the-blank. The IconQA dataset is inspired by real-world diagram word problems that highlight the importance of abstract diagram understanding and comprehensive cognitive reasoning. Thus, IconQA requires not only perception skills like object recognition and text understanding, but also diverse cognitive reasoning skills, such as geometric reasoning, commonsense reasoning, and arithmetic reasoning. To facilitate potential IconQA models to learn semantic representations for icon images, we further release an icon dataset Icon645 which contains 645,687 colored icons on 377 classes. We conduct extensive user studies and blind experiments and reproduce a wide range of advanced VQA methods to benchmark the IconQA task. Also, we develop a strong IconQA baseline Patch-TRM that applies a pyramid cross-modal Transformer with input diagram embeddings pre-trained on the icon dataset. IconQA and Icon645 are available at https://iconqa.github.io.
翻译:当前视觉解答( VQA) 任务主要考虑解答自然图像的人类附加说明问题。 但是,除了自然图像外, 具有语义丰富内容的抽象图表在视觉理解和推理研究中仍然没有得到充分研究。 在这项工作中, 我们引入了图标问答( IconQA) 的新挑战, 目标是在图标图像背景下回答一个问题。 我们发布由107, 439个问题和三个子任务组成的大型数据集“ 图标QA ”, 包括多图像选取、多文本选取、填充等自然图像。 图标QA 数据集受到真实世界图表词问题的启发, 这些问题凸显了抽象图表理解和全面认知推理的重要性。 因此, IPAQA 不仅需要像目标识别和文字理解这样的感知力技能, 而且还需要多种认知推理技能, 如几何推理、常识推理和算推理等。 为学习图标图像的配置模型, 我们进一步发布含有645 QA 红度图的图标, 以及用于377 A 类的图像的高级图像级A 。 我们还广泛进行了一个测试, 的用户对VSqrealbalBA 做了一个测试。