Image-sentence retrieval has attracted extensive research attention in multimedia and computer vision due to its promising application. The key issue lies in jointly learning the visual and textual representation to accurately estimate their similarity. To this end, the mainstream schema adopts an object-word based attention to calculate their relevance scores and refine their interactive representations with the attention features, which, however, neglects the context of the object representation on the inter-object relationship that matches the predicates in sentences. In this paper, we propose a Cross-modal Semantic Enhanced Interaction method, termed CMSEI for image-sentence retrieval, which correlates the intra- and inter-modal semantics between objects and words. In particular, we first design the intra-modal spatial and semantic graphs based reasoning to enhance the semantic representations of objects guided by the explicit relationships of the objects' spatial positions and their scene graph. Then the visual and textual semantic representations are refined jointly via the inter-modal interactive attention and the cross-modal alignment. To correlate the context of objects with the textual context, we further refine the visual semantic representation via the cross-level object-sentence and word-image based interactive attention. Experimental results on seven standard evaluation metrics show that the proposed CMSEI outperforms the state-of-the-art and the alternative approaches on MS-COCO and Flickr30K benchmarks.
翻译:图像感知检索因其有希望的应用,在多媒体和计算机视觉中吸引了广泛的研究关注。关键问题在于共同学习视觉和文字表达方式,以准确估计其相似性。为此,主流系统模式采用基于目标的注意,以计算其关联性分数,并用关注特征来完善其互动表达方式,但是,这些特征忽视了与句子前导相匹配的跨对象关系中对象表达方式的背景。在本文件中,我们提议一种跨现代语义强化互动互动方法,称为CMSEI,用于图像-感知检索,它将对象和语言之间的现代表达方式联系起来。特别是,我们首先设计一个基于目标相关性评分的基于目标空间位置及其场图的清晰关系来调整对象的语义表达方式。然后,视觉和文字语义表达方式通过现代互动关注和跨模式校正匹配对象的背景,我们通过跨层次的实验性CIMS-MS-IMS-S-IMS-IMS-SD 标准结果展示了基于语言和标准格式的图像-结果。