There are two popular loss functions used for vision-language retrieval, i.e., triplet loss and contrastive learning loss, both of them essentially minimize the difference between the similarities of negative pairs and positive pairs. More specifically, Triplet loss with Hard Negative mining (Triplet-HN), which is widely used in existing retrieval models to improve the discriminative ability, is easy to fall into local minima in training. On the other hand, Vision-Language Contrastive learning loss (VLC), which is widely used in the vision-language pre-training, has been shown to achieve significant performance gains on vision-language retrieval, but the performance of fine-tuning with VLC on small datasets is not satisfactory. This paper proposes a unified loss of pair similarity optimization for vision-language retrieval, providing a powerful tool for understanding existing loss functions. Our unified loss includes the hard sample mining strategy of VLC and introduces the margin used by the triplet loss for better similarity separation. It is shown that both Triplet-HN and VLC are special forms of our unified loss. Compared with the Triplet-HN, our unified loss has a fast convergence speed. Compared with the VLC, our unified loss is more discriminative and can provide better generalization in downstream fine-tuning tasks. Experiments on image-text and video-text retrieval benchmarks show that our unified loss can significantly improve the performance of the state-of-the-art retrieval models.
翻译:有两个常见的损失功能用于视觉语言检索,即三重损失和对比式学习损失,这两个功能基本上将负对和正对的相似性差异缩小到最低程度。更具体地说,硬负式采矿(Triplet-HN)的三重损失(Triplet-HN)在现有的检索模型中被广泛用于提高歧视能力,很容易在培训中落到当地的微型中。另一方面,在视觉语言前培训中广泛使用的视野-语言对立学习损失(VLC)被证明在视觉语言检索方面取得了显著的业绩收益,但与VLC在小型数据集上的微调效果并不令人满意。本文提出在视觉检索方面统一地丧失对齐的优化,为理解现有的损失功能提供了强大的工具。我们的统一损失包括VLC的硬取样战略,并提出了三重损失对更相似性分离的差差值。显示,Triplet-HN和VLC都是我们统一损失损失的特殊形式。与Trible-HN相比,我们的统一损失率的升级速度可以使我们在总的LC上更接近性损失率上更快地调整。