Advances in the field of visual-language contrastive learning have made it possible for many downstream applications to be carried out efficiently and accurately by simply taking the dot product between image and text representations. One of the most representative approaches proposed recently known as CLIP has quickly garnered widespread adoption due to its effectiveness. CLIP is trained with an InfoNCE loss that takes into account both positive and negative samples to help learn a much more robust representation space. This paper however reveals that the common downstream practice of taking a dot product is only a zeroth-order approximation of the optimization goal, resulting in a loss of information during test-time. Intuitively, since the model has been optimized based on the InfoNCE loss, test-time procedures should ideally also be in alignment. The question lies in how one can retrieve any semblance of negative samples information during inference. We propose Distribution Normalization (DN), where we approximate the mean representation of a batch of test samples and use such a mean to represent what would be analogous to negative samples in the InfoNCE loss. DN requires no retraining or fine-tuning and can be effortlessly applied during inference. Extensive experiments on a wide variety of downstream tasks exhibit a clear advantage of DN over the dot product.
翻译:视觉语言对比学习领域的进步使得许多下游应用有可能通过在图像和文本表述之间取点产品来有效而准确地进行。最近提出的最具代表性的方法之一,即CLIP因其有效性而迅速获得广泛采用。CLIP受过InfoNCE损失的培训,这种损失既考虑到正面的和负面的样本,又考虑到一个更强有力的代表空间。但本文显示,采用点产品这一下游常见做法只是优化目标的零级近似,在测试期间造成信息丢失。由于基于InfoNCE损失而优化了模型,试验时间程序最好也能够配合使用。问题在于人们如何在引文中检索任何负面样本信息的外观。我们建议分配标准化(DN),其中我们比照一组测试样品的平均表示,使用这种手段来代表信息NCE损失中与负面样本相近的近似之处。DN不需要再培训或微调,在下游产品展示期间,可以不费力地利用下游产品的巨大优势。