Despite the achievements of large-scale multimodal pre-training approaches, cross-modal retrieval, e.g., image-text retrieval, remains a challenging task. To bridge the semantic gap between the two modalities, previous studies mainly focus on word-region alignment at the object level, lacking the matching between the linguistic relation among the words and the visual relation among the regions. The neglect of such relation consistency impairs the contextualized representation of image-text pairs and hinders the model performance and the interpretability. In this paper, we first propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations. In response, we present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions from the two modalities mutually via inter-modal alignment. The IAIS regularizer boosts the performance of prevailing models on Flickr30k and MS COCO datasets by a considerable margin, which demonstrates the superiority of our approach.
翻译:尽管在大规模多式联运培训前做法方面取得了成就,但交叉检索(如图像-文字检索)仍然是一项艰巨的任务。为了缩小两种模式之间的语义差距,先前的研究主要侧重于目标一级的字区协调,缺乏语言之间语言关系和区域视觉关系之间的匹配。忽视这种关系的一致性妨碍了图像-文字对相框的背景化表现,妨碍了模型的性能和可解释性。在本文件中,我们首先提出一种新的衡量标准,即内部模式自留自留自留自留自留(ISD),通过测量语言和视觉关系之间的语义距离来量化关系的一致性。作为回应,我们介绍了关于内部模式自留自留自留自留的现代协调(IAIS),这是一种正规化的培训方法,以优化综合格式和校准内部自留自留两种模式,通过模式的相互配合校正。在Flick30k和MS COCO数据集集的流行模型的性能得到了相当大的提高,展示了我们方法的优越性。