Remote sensing (RS) cross-modal text-image retrieval has attracted extensive attention for its advantages of flexible input and efficient query. However, traditional methods ignore the characteristics of multi-scale and redundant targets in RS image, leading to the degradation of retrieval accuracy. To cope with the problem of multi-scale scarcity and target redundancy in RS multimodal retrieval task, we come up with a novel asymmetric multimodal feature matching network (AMFMN). Our model adapts to multi-scale feature inputs, favors multi-source retrieval methods, and can dynamically filter redundant features. AMFMN employs the multi-scale visual self-attention (MVSA) module to extract the salient features of RS image and utilizes visual features to guide the text representation. Furthermore, to alleviate the positive samples ambiguity caused by the strong intraclass similarity in RS image, we propose a triplet loss function with dynamic variable margin based on prior similarity of sample pairs. Finally, unlike the traditional RS image-text dataset with coarse text and higher intraclass similarity, we construct a fine-grained and more challenging Remote sensing Image-Text Match dataset (RSITMD), which supports RS image retrieval through keywords and sentence separately and jointly. Experiments on four RS text-image datasets demonstrate that the proposed model can achieve state-of-the-art performance in cross-modal RS text-image retrieval task.
翻译:由于灵活输入和高效查询的优点,跨式遥感(RS)跨模式文本图像检索吸引了广泛的关注。然而,传统方法忽视了塞族共和国图像中多种规模和冗余目标的特点,导致检索准确性下降。为了应对塞族共和国多式联运检索任务中多种规模稀缺和目标冗余的问题,我们提出了一个新的非对称多式联运特征匹配网络(AMFMN)。我们的模式适应了多种规模的特征投入,有利于多源检索方法,并且能够动态地过滤冗余功能。AMFMMN使用多尺度的视觉自备模块来提取塞族共和国图像的突出特征,并利用视觉特征来指导文本表述。此外,为了减轻塞族共和国图像中高度相似的类内差异造成的正面样本模糊性,我们提出了三重损失功能,根据样品配对比的先前相似性,具有动态的变差幅度。最后,与传统的塞族共和国图像文本数据集不同,我们建立了一个精细和更具挑战性的遥感图像缩略图解模块数据集模型化模块化模块化模块化模块化模块化模块化模型化,这可以单独地支持RSS类内部图像搜索工具化的跨版本。