Exploring fine-grained relationship between entities(e.g. objects in image or words in sentence) has great contribution to understand multimedia content precisely. Previous attention mechanism employed in image-text matching either takes multiple self attention steps to gather correspondences or uses image objects (or words) as context to infer image-text similarity. However, they only take advantage of semantic information without considering that objects' relative position also contributes to image understanding. To this end, we introduce a novel position-aware relation module to model both the semantic and spatial relationship simultaneously for image-text matching in this paper. Given an image, our method utilizes the location of different objects to capture spatial relationship innovatively. With the combination of semantic and spatial relationship, it's easier to understand the content of different modalities (images and sentences) and capture fine-grained latent correspondences of image-text pairs. Besides, we employ a two-step aggregated relation module to capture interpretable alignment of image-text pairs. The first step, we call it intra-modal relation mechanism, in which we computes responses between different objects in an image or different words in a sentence separately; The second step, we call it inter-modal relation mechanism, in which the query plays a role of textual context to refine the relationship among object proposals in an image. In this way, our position-aware aggregated relation network (ParNet) not only knows which entities are relevant by attending on different objects (words) adaptively, but also adjust the inter-modal correspondence according to the latent alignments according to query's content. Our approach achieves the state-of-the-art results on MS-COCO dataset.
翻译:探索实体( 如图像中的物体或句子中的文字) 之间的细微关系可以极大地帮助准确理解多媒体内容。 在图像文本匹配中, 先前使用的注意机制要么采取多度自关注步骤收集信件或使用图像对象( 或文字) 来推断图像文本的相似性。 但是, 它们只是利用语义信息, 而不考虑对象相对位置也有利于图像理解。 为此, 我们引入了一个新的位置感知关系模块, 用于同时模拟图像文本匹配的语义和空间关系。 在图像中, 我们使用的方法使用不同对象的位置来创新地捕捉空间关系。 由于语义和空间关系相结合, 将图像文本对象( 图像和句) 和图像文本相近的隐性对应性。 此外, 我们使用两步总和关系模块来捕捉可解释的图像- 文本对配对的匹配。 第一步, 我们称之为内部关系机制, 我们使用不同对象的定位, 在不同的目标中, 以不同的图像或图像关系中, 跨级关系中, 将我们使用一个跨级的图像关系, 排序关系中, 将我们使用一个跨级的图像关系中, 排序中, 将我们使用我们使用一个图像关系 的图像关系中, 的跨级关系中, 将一个跨级关系中, 将我们使用一个跨级关系, 将一个图像关系中, 将一个跨级关系中, 将我们使用一个图像关系 将一个图像关系定位的跨级关系中, 。