Referring expression comprehension aims to localize objects identified by natural language descriptions. This is a challenging task as it requires understanding of both visual and language domains. One nature is that each object can be described by synonymous sentences with paraphrases, and such varieties in languages have critical impact on learning a comprehension model. While prior work usually treats each sentence and attends it to an object separately, we focus on learning a referring expression comprehension model that considers the property in synonymous sentences. To this end, we develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels, where features extracted from synonymous sentences to describe the same object should be closer to each other after mapping to the visual domain. We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets, and demonstrate that our method performs favorably against the state-of-the-art approaches. Furthermore, since the varieties in expressions become larger across datasets when they describe objects in different ways, we present the cross-dataset and transfer learning settings to validate the ability of our learned transferable features.
翻译:引用表达理解的目的是将自然语言描述中发现的物体本地化。 这是一项具有挑战性的任务, 因为它需要理解视觉和语言领域。 一种性质是, 每个对象都可以用同义词用副词句描述, 语言中的这种品种对学习理解模型有重大影响。 虽然先前的工作通常对每个句子进行处理, 并将它单独处理到一个对象上, 我们侧重于学习一个参考表达理解模型, 将等同句中的属性纳入同义句中。 为此, 我们开发了一个端到端到端的可训练框架, 以学习图像和对象实例层次上的对比特征, 从同义句中提取来描述同一对象的特征, 在映射到视觉域后, 我们进行广泛的实验, 来评估几个基准数据集的拟议算法, 并证明我们的方法优于最先进的方法。 此外, 由于表达的品种在以不同方式描述对象时会变得更大, 我们展示交叉数据设置和传输学习环境, 以验证我们所学过的可转移特性的能力。