In this paper, we study the effect of a novel regularization scheme on contrastive language-image pre-trained (CLIP) models. Our approach is based on the observation that, in many domains, text tokens should only describe a small number of image regions and, likewise, each image region should correspond to only a few text tokens. In CLIP-style models, this implies that text-token embeddings should have high similarity to only a small number of image-patch embeddings for a given image-text pair. We formalize this observation using a novel regularization scheme that penalizes the entropy of the text-token to image-patch similarity scores. We qualitatively and quantitatively demonstrate that the proposed regularization scheme shrinks the text-token and image-patch similarity scores towards zero, thus achieving the desired effect. We demonstrate the promise of our approach in an important medical context where this underlying hypothesis naturally arises. Using our proposed approach, we achieve state of the art (SOTA) zero-shot performance on all tasks from the CheXpert chest x-ray dataset, outperforming an unregularized version of the model and several recently published self-supervised models.
翻译:在本文中,我们研究了新颖的正规化计划对对比性语言图像培训前(CLIP)模型的影响。我们的方法是基于这样一种观察,即在许多领域,文本符号只应描述少量图像区域,同样,每个图像区域只对应几个文本符号。在CLIP式模型中,这意味着文本式嵌入器与给定图像文本对应体的少量图像批量嵌入器高度相似。我们使用一种新的正规化计划正式化了这一观察,它惩罚了文本加到图像批量相似分数的诱变。我们从质量和数量上表明,拟议的正规化计划将文本拖动和图像批次相似的分数缩到零,从而达到预期效果。我们展示了我们的方法在重要医学环境中的许诺,而这一基本假设自然产生。我们采用我们提出的方法,在CheXpert胸X光数据集的所有任务中,我们实现了艺术(SOTA)零镜头状态,我们表现了一种不正规的模型和最近公布的几种自我监督模型。