Visual-Semantic Embedding (VSE) aims to learn an embedding space where related visual and semantic instances are close to each other. Recent VSE models tend to design complex structures to pool visual and semantic features into fixed-length vectors and use hard triplet loss for optimization. However, we find that: (1) combining simple pooling methods is no worse than these sophisticated methods; and (2) only considering the most difficult-to-distinguish negative sample leads to slow convergence and poor Recall@K improvement. To this end, we propose an adaptive pooling strategy that allows the model to learn how to aggregate features through a combination of simple pooling methods. We also introduce a strategy to dynamically select a group of negative samples to make the optimization converge faster and perform better. Experimental results on Flickr30K and MS-COCO demonstrate that a standard VSE using our pooling and optimization strategies outperforms current state-of-the-art systems (at least 1.0% on the metrics of recall) in image-to-text and text-to-image retrieval. Source code of our experiments is available at https://github.com/96-Zachary/vse_2ad.
翻译:视觉-视觉-智能嵌入(VSE)的目的是学习一个嵌入空间,让相关的视觉和语义实例彼此接近。最近的VSE模型往往设计复杂的结构,将视觉和语义特征汇集到固定长度的矢量中,并使用硬三重损失优化。然而,我们发现:(1) 将简单的集合方法结合起来并不比这些复杂的方法更差;(2) 仅考虑最难分解的负面样本会导致缓慢的趋同和错误的回溯@K改进。为此,我们提议了一项适应性集合战略,使模型能够学习如何通过简单的集合方法组合组合组合组合特征。我们还采用一项战略,动态地选择一组负面样本,使优化更快集中和更好地运行。Flick30K和MS-CO的实验结果显示,使用我们的集合和优化战略的标准VSE在图像-文字和文本-image检索中比目前最先进的系统(至少高出1.0 % ) 。我们的实验源代码可在 https://girub2-Z/96-Zach/imary查阅。