To train image-caption retrieval (ICR) methods, contrastive loss functions are a common choice for optimization functions. Unfortunately, contrastive ICR methods are vulnerable to learning shortcuts: decision rules that perform well on the training data but fail to transfer to other testing conditions. We introduce an approach to reduce shortcut feature representations for the ICR task: latent target decoding (LTD). We add an additional decoder to the learning framework to reconstruct the input caption, which prevents the image and caption encoder from learning shortcut features. Instead of reconstructing input captions in the input space, we decode the semantics of the caption in a latent space. We implement the LTD objective as an optimization constraint, to ensure that the reconstruction loss is below a threshold value while primarily optimizing for the contrastive loss. Importantly, LTD does not depend on additional training data or expensive (hard) negative mining strategies. Our experiments show that, unlike reconstructing the input caption, LTD reduces shortcut learning and improves generalizability by obtaining higher recall@k and r-precision scores. Additionally, we show that the evaluation scores benefit from implementing LTD as an optimization constraint instead of a dual loss.
翻译:为了培训图像解码(ICR)方法,对比式损失功能是优化功能的一种常见选择。 不幸的是,对比式的ICR方法容易学习捷径:决定规则对培训数据效果良好,但未能转移到其他测试条件。我们引入了一种办法来减少ICR任务的捷径特征显示:潜在目标解码(LTD) 。我们在学习框架中增加了一个解码器,以重建输入标题,使图像和字幕编码编码器无法学习快捷功能。我们不是在输入空间重建输入标题,而是在隐蔽空间解码标题的语义。我们把LTD目标作为一种优化限制,以确保重建损失低于临界值,而主要是优化对比性损失。重要的是,LTD并不依赖额外的培训数据或昂贵(硬)负式采矿战略。我们的实验显示,与重建输入标题不同的是,LTD减少快捷式学习,并通过获取更高的回溯@k和r-precision分数来改进一般性。此外,我们显示,实施LTD作为双重损失最高限值的评价得分。