Visual Semantic Embedding (VSE) aims to extract the semantics of images and their descriptions, and embed them into the same latent space for cross-modal information retrieval. Most existing VSE networks are trained by adopting a hard negatives loss function which learns an objective margin between the similarity of relevant and irrelevant image-description embedding pairs. However, the objective margin in the hard negatives loss function is set as a fixed hyperparameter that ignores the semantic differences of the irrelevant image-description pairs. To address the challenge of measuring the optimal similarities between image-description pairs before obtaining the trained VSE networks, this paper presents a novel approach that comprises two main parts: (1) finds the underlying semantics of image descriptions; and (2) proposes a novel semantically enhanced hard negatives loss function, where the learning objective is dynamically determined based on the optimal similarity scores between irrelevant image-description pairs. Extensive experiments were carried out by integrating the proposed methods into five state-of-the-art VSE networks that were applied to three benchmark datasets for cross-modal information retrieval tasks. The results revealed that the proposed methods achieved the best performance and can also be adopted by existing and future VSE networks.
翻译:视觉和语义嵌入器(VSE)旨在提取图像的语义及其描述,并将其嵌入供跨模式信息检索的相同潜在空间。大多数现有的VSE网络都通过采用硬负值损失功能接受培训,这种功能可以了解相关和不相关的图像描述嵌入配对的相似性之间的客观差值。然而,硬负值损失函数的客观差值被设定为固定的超参数,它忽视了不相干图像描述配对的语义差异。为了应对在获得经过培训的 VSE 网络之前测量图像描述配对之间最佳相似性的挑战,本文提出了一个由两个主要部分组成的新颖方法:(1) 找到图像描述的基本语义;(2) 提出一种新的语义强化硬负损失函数,在这种功能中学习目标是根据不相关图像描述配对的最佳相似性分数动态确定的。通过将拟议方法整合到五个状态的VSE网络,将应用到三个基准数据集进行跨模式信息检索,并且通过现有网络实现最佳性能显示最佳性结果。