With the development of earth observation technology, massive amounts of remote sensing (RS) images are acquired. To find useful information from these images, cross-modal RS image-voice retrieval provides a new insight. This paper aims to study the task of RS image-voice retrieval so as to search effective information from massive amounts of RS data. Existing methods for RS image-voice retrieval rely primarily on the pairwise relationship to narrow the heterogeneous semantic gap between images and voices. However, apart from the pairwise relationship included in the datasets, the intra-modality and non-paired inter-modality relationships should also be taken into account simultaneously, since the semantic consistency among non-paired representations plays an important role in the RS image-voice retrieval task. Inspired by this, a semantics-consistent representation learning (SCRL) method is proposed for RS image-voice retrieval. The main novelty is that the proposed method takes the pairwise, intra-modality, and non-paired inter-modality relationships into account simultaneously, thereby improving the semantic consistency of the learned representations for the RS image-voice retrieval. The proposed SCRL method consists of two main steps: 1) semantics encoding and 2) semantics-consistent representation learning. Firstly, an image encoding network is adopted to extract high-level image features with a transfer learning strategy, and a voice encoding network with dilated convolution is devised to obtain high-level voice features. Secondly, a consistent representation space is conducted by modeling the three kinds of relationships to narrow the heterogeneous semantic gap and learn semantics-consistent representations across two modalities. Extensive experimental results on three challenging RS image-voice datasets show the effectiveness of the proposed method.
翻译:随着地球观测技术的发展,大量遥感图像(RS)获得。为了从这些图像中找到有用的信息,交叉式RS图像-声音检索提供了新的洞察力。本文旨在研究RS图像-声音检索的任务,以便从大量RS数据中搜索有效信息。现有的RS图像-声音检索方法主要依靠对称关系,以缩小图像和声音之间差异的语义差距。但是,除了数据集中包含的对称关系之外,还应同时考虑内部的语音流和无偏差的时装关系,因为非偏差的表达面的语义一致性在RS图像-声音检索任务中起着重要作用。受此启发,为RS图像-声音检索提出了一种语义-一致的表达学习方法(SCRL)方法。主要的新颖之处是,拟议的方法既采用对称模式、内部的语音流和不偏差的多种模式关系,也同时考虑,从而改进非偏差的语义-内部的图像-现代的表达面义性特征一致性在RS图像-声音-图像-声音-图像-图像-图像-图像-图像-图像-图像-图像-图像-图像-图像-图像-图像-图像-图像-图像-图像-图像-图像-图像-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-系统-