Recently, there has been an increasing interest in building question answering (QA) models that reason across multiple modalities, such as text and images. However, QA using images is often limited to just picking the answer from a pre-defined set of options. In addition, images in the real world, especially in news, have objects that are co-referential to the text, with complementary information from both modalities. In this paper, we present a new QA evaluation benchmark with 1,384 questions over news articles that require cross-media grounding of objects in images onto text. Specifically, the task involves multi-hop questions that require reasoning over image-caption pairs to identify the grounded visual object being referred to and then predicting a span from the news body text to answer the question. In addition, we introduce a novel multimedia data augmentation framework, based on cross-media knowledge extraction and synthetic question-answer generation, to automatically augment data that can provide weak supervision for this task. We evaluate both pipeline-based and end-to-end pretraining-based multimedia QA models on our benchmark, and show that they achieve promising performance, while considerably lagging behind human performance hence leaving large room for future work on this challenging new task.
翻译:最近,人们越来越有兴趣建立解答(QA)模型,这种模型有多种模式,如文本和图像。然而,使用图像的多点问题往往仅限于从预先定义的一套选项中选择答案。此外,真实世界的图像,特别是新闻中的图像,具有共同偏向文本的物体,有两种模式的补充信息。在本文件中,我们提出了一个新的质量A评估基准,其中1,384个问题涉及新闻文章,这些文章要求对图像中的物体进行跨媒体定位。具体地说,这项任务涉及多点问题,需要用多点机会来推理图像覆盖对对面,以确定所引用的有根的视觉对象,然后预测从新闻机构文本中抽出一个宽度来回答问题。此外,我们还引入了一个基于跨媒体知识提取和合成问答生成的新的多媒体数据增强框架,以自动增加数据,为这项任务提供薄弱的监管。我们评估了基于管道和终端至终端的多媒体QA模型关于我们的基准,并表明它们取得了有希望的业绩,同时大大落后于今后这项具有挑战性的新任务的工作。