Cross-modal information retrieval aims to find heterogeneous data of various modalities from a given query of one modality. The main challenge is to map different modalities into a common semantic space, in which distance between concepts in different modalities can be well modeled. For cross-modal information retrieval between images and texts, existing work mostly uses off-the-shelf Convolutional Neural Network (CNN) for image feature extraction. For texts, word-level features such as bag-of-words or word2vec are employed to build deep learning models to represent texts. Besides word-level semantics, the semantic relations between words are also informative but less explored. In this paper, we model texts by graphs using similarity measure based on word2vec. A dual-path neural network model is proposed for couple feature learning in cross-modal information retrieval. One path utilizes Graph Convolutional Network (GCN) for text modeling based on graph representations. The other path uses a neural network with layers of nonlinearities for image modeling based on off-the-shelf features. The model is trained by a pairwise similarity loss function to maximize the similarity of relevant text-image pairs and minimize the similarity of irrelevant pairs. Experimental results show that the proposed model outperforms the state-of-the-art methods significantly, with 17% improvement on accuracy for the best case.
翻译:跨模式信息检索的目的是从一种模式的查询中找到不同模式的不同数据。主要的挑战在于将不同模式映射成一个共同的语义空间,不同模式概念之间的距离可以很好地建模。对于图像和文本之间的交叉模式信息检索,现有工作大多使用现成的共发神经网络(CNN)来提取图像特征。对于文本,用词级特征,如词包或单词2vec等字级特征来构建深层次学习模型来代表文本。除了字级语义学外,文字之间的语义关系也是信息化的,但探索较少。在本文件中,我们用基于Word2vec的类似度计量的图表来模拟文本和文字之间的距离。建议双向神经网络模式用于在交叉模式信息检索中进行双向特征学习。一个路径使用图形革命网络(GCN)来根据图形表达模式进行文本建模。另一个路径使用一个具有非线性层次的神经网络来根据离位特征的特征进行图像建模。模型以类似的方式对文本模型进行模拟,并用类似的方式演示模式显示最不相关的损失功能。