Most image-text retrieval work adopts binary labels indicating whether a pair of image and text matches or not. Such a binary indicator covers only a limited subset of image-text semantic relations, which is insufficient to represent relevance degrees between images and texts described by continuous labels such as image captions. The visual-semantic embedding space obtained by learning binary labels is incoherent and cannot fully characterize the relevance degrees. In addition to the use of binary labels, this paper further incorporates continuous pseudo labels (generally approximated by text similarity between captions) to indicate the relevance degrees. To learn a coherent embedding space, we propose an image-text retrieval framework with Binary and Continuous Label Supervision (BCLS), where binary labels are used to guide the retrieval model to learn limited binary correlations, and continuous labels are complementary to the learning of image-text semantic relations. For the learning of binary labels, we improve the common Triplet ranking loss with Soft Negative mining (Triplet-SN) to improve convergence. For the learning of continuous labels, we design Kendall ranking loss inspired by Kendall rank correlation coefficient (Kendall), which improves the correlation between the similarity scores predicted by the retrieval model and the continuous labels. To mitigate the noise introduced by the continuous pseudo labels, we further design Sliding Window sampling and Hard Sample mining strategy (SW-HS) to alleviate the impact of noise and reduce the complexity of our framework to the same order of magnitude as the triplet ranking loss. Extensive experiments on two image-text retrieval benchmarks demonstrate that our method can improve the performance of state-of-the-art image-text retrieval models.
翻译:多数图像文本检索工作都采用二进制标签, 表明一对图像和文本是否匹配。 此二进制指标仅包含一个有限的图像文本语义关系子集, 不足以代表图像文本语义关系中的关联度, 不足以代表图像和图像标题等连续标签描述的文本之间的关联度。 通过学习二进制标签获得的视觉- 语义嵌入空间不连贯, 无法完全描述相关度。 除使用二进制标签外, 本文还包含连续的假标签( 一般以文本相似的标题相似性相近), 以显示相关度。 为了学习一个连贯的嵌入空间, 我们建议用 Binary 和 Contracting Label 监管( BBLLS) 建立一个图像文本检索框架, 其中使用二进制标签来指导检索模型学习有限的双进制相关性, 而连续的标签补充着学习图像- 语义语义关系。 除了学习二进制标签, 我们用Sft- Nexli li risteil reli li reli li (T- State- States reliction- State- State) livering) 来改进共同的排序损失, 损失, 来改进我们不断变缩标签的缩缩缩定义定义定义的排序,, 我们的缩缩缩缩缩缩缩缩缩缩缩略图, 和变缩缩缩缩缩缩算的缩缩的缩缩缩缩缩的缩的缩的缩缩图。