Most existing work that grounds natural language phrases in images starts with the assumption that the phrase in question is relevant to the image. In this paper we address a more realistic version of the natural language grounding task where we must both identify whether the phrase is relevant to an image and localize the phrase. This can also be viewed as a generalization of object detection to an open-ended vocabulary, introducing elements of few- and zero-shot detection. We propose an approach for this task that extends Faster R-CNN to relate image regions and phrases. By carefully initializing the classification layers of our network using canonical correlation analysis (CCA), we encourage a solution that is more discerning when reasoning between similar phrases, resulting in over double the performance compared to a naive adaptation on three popular phrase grounding datasets, Flickr30K Entities, ReferIt Game, and Visual Genome, with test-time phrase vocabulary sizes of 5K, 32K, and 159K, respectively.
翻译:将图像中的自然语言短语作为自然语言短语依据的多数现有工作,首先假设有关短语与图像相关。在本文中,我们处理的自然语言基础任务的更现实版本,我们必须同时确定该短语是否与图像相关,并将其本地化。这也可以被视为将物体探测概括到一个开放式词汇中,引入了少发和零发检测的要素。我们为这项任务提出了一个方法,将快速R-CNN与图像区域和短语联系起来。我们通过使用卡通相关分析(CCA)仔细初始化我们的网络分类层,我们鼓励一种解决办法,在对类似短语进行推理时,这种解决办法更能分辨,导致性能超过一倍,而对于三种流行的词基数据集(Flick30K实体、ReferitI Game和视觉基因组)则进行天性调整,测试时词汇大小分别为5K、32K和159K。