The ideal form of Visual Question Answering requires understanding, grounding and reasoning in the joint space of vision and language and serves as a proxy for the AI task of scene understanding. However, most existing VQA benchmarks are limited to just picking the answer from a pre-defined set of options and lack attention to text. We present a new challenge with a dataset that contains 23,781 questions based on 10124 image-text pairs. Specifically, the task requires the model to align multimedia representations of the same entity to implement multi-hop reasoning between image and text and finally use natural language to answer the question. The aim of this challenge is to develop and benchmark models that are capable of multimedia entity alignment, multi-step reasoning and open-ended answer generation.
翻译:理想的视觉问题解答形式要求在视觉和语言的共同空间中理解、定位和推理,并充当AI实地理解任务的代理。然而,大多数现有的VQA基准仅限于从一套预先确定的选项中选择答案,对文本缺乏重视。我们提出了一个新的挑战,即数据集包含基于10124对图像-文本的23 781个问题。具体地说,这项任务要求将同一实体的多媒体表述模式统一起来,以便在图像和文本之间实现多动的推理,并最终使用自然语言回答问题。这一挑战的目的是制定和基准模型,这些模型能够使多媒体实体、多步骤推理和开放式的答案产生。</s>