We present a task and benchmark dataset for person-centric visual grounding, the problem of linking between people named in a caption and people pictured in an image. In contrast to prior work in visual grounding, which is predominantly object-based, our new task masks out the names of people in captions in order to encourage methods trained on such image-caption pairs to focus on contextual cues (such as rich interactions between multiple people), rather than learning associations between names and appearances. To facilitate this task, we introduce a new dataset, Who's Waldo, mined automatically from image-caption data on Wikimedia Commons. We propose a Transformer-based method that outperforms several strong baselines on this task, and are releasing our data to the research community to spur work on contextual models that consider both vision and language.
翻译:我们为以人为中心的视觉定位提出了一个任务和基准数据集,即标题中命名的人与图像中描绘的人之间的联系问题。 与以往主要以目标为基础的视觉定位工作相比,我们的新任务掩盖了字幕中的人的名字,以鼓励在这种图像描述配对方面受过培训的方法,以关注背景提示(如多人之间的丰富互动),而不是学习姓名和外貌之间的联系。为了便利这项工作,我们引入了一个新的数据集,即“谁是Waldo”,从维基媒体公域图像描述数据中自动提取。我们提出了一种基于变换器的方法,该方法比这项任务的几条强有力的基线要强,并正在向研究界公布我们的数据,以激励关于既考虑愿景又考虑语言的背景模型的工作。