降低资源受限对比图像字幕检索中预测特征压制的影响 (Reducing Predictive Feature Suppression in Resource-Constrained Contrastive Image-Caption Retrieval)

from arxiv, Published in Transactions on Machine Learning Research OpenReview: https://openreview.net/forum?id=T1XtOqrVKn Code: https://github.com/MauritsBleeker/reducing-predictive-feature-suppression

To train image-caption retrieval (ICR) methods, contrastive loss functions are a common choice for optimization functions. Unfortunately, contrastive ICR methods are vulnerable to predictive feature suppression. Predictive features are features that correctly indicate the similarity between a query and a candidate item. However, in the presence of multiple predictive features during training, encoder models tend to suppress redundant predictive features, since these features are not needed to learn to discriminate between positive and negative pairs. While some predictive features are redundant during training, these features might be relevant during evaluation. We introduce an approach to reduce predictive feature suppression for resource-constrained ICR methods: latent target decoding (LTD). We add an additional decoder to the contrastive ICR framework, to reconstruct the input caption in a latent space of a general-purpose sentence encoder, which prevents the image and caption encoder from suppressing predictive features. We implement the LTD objective as an optimization constraint, to ensure that the reconstruction loss is below a bound value while primarily optimizing for the contrastive loss. Importantly, LTD does not depend on additional training data or expensive (hard) negative mining strategies. Our experiments show that, unlike reconstructing the input caption in the input space, LTD reduces predictive feature suppression, measured by obtaining higher recall@k, r-precision, and nDCG scores than a contrastive ICR baseline. Moreover, we show that LTD should be implemented as an optimization constraint instead of a dual optimization objective. Finally, we show that LTD can be used with different contrastive learning losses and a wide variety of resource-constrained ICR methods.

翻译：为训练图像字幕检索（ICR）方法，对比损失函数是一种常见的优化函数选择。然而，对比ICR方法容易受到预测特征的压制影响。预测特征是正确指示查询和候选项之间相似性的特征。但是，在训练期间存在多个预测特征时，编码器模型往往会抑制冗余的预测特征，因为这些特征不需要用于学习如何区分正负样本。虽然在训练期间一些预测特征是冗余的，但在评估期间这些特征可能是相关的。我们引入了一种方法来降低资源受限ICR方法中预测特征的压制：潜目标解码（LTD）。我们在对比ICR框架中添加了一个额外的解码器，以在通用句子编码器的潜在空间中重构输入标题，从而防止图像和标题编码器抑制预测特征。我们将LTD目标实现为优化约束，以确保重构损失低于绑定值而主要优化对比损失。重要的是，LTD不依赖于额外的训练数据或昂贵的（硬性）负面采矿策略。我们的实验表明，与在输入空间重构输入标题不同，LTD减少了预测特征的压制，通过获得比对比ICR基线更高的召回@k、r-precision和nDCG分数来衡量。此外，我们表明，LTD应该实现为优化约束而不是双重优化目标。最后，我们表明，LTD可以与不同的对比学习损失和各种资源受限的ICR方法一起使用。