CLIP yielded impressive results on zero-shot transfer learning tasks and is considered as a foundation model like BERT or GPT3. CLIP vision models that have a rich representation are pre-trained using the InfoNCE objective and natural language supervision before they are fine-tuned on particular tasks. Though CLIP excels at zero-shot transfer learning, it suffers from an explaining away problem, that is, it focuses on one or few features, while neglecting other relevant features. This problem is caused by insufficiently extracting the covariance structure in the original multi-modal data. We suggest to use modern Hopfield networks to tackle the problem of explaining away. Their retrieved embeddings have an enriched covariance structure derived from co-occurrences of features in the stored embeddings. However, modern Hopfield networks increase the saturation effect of the InfoNCE objective which hampers learning. We propose to use the InfoLOOB objective to mitigate this saturation effect. We introduce the novel "Contrastive Leave One Out Boost" (CLOOB), which uses modern Hopfield networks for covariance enrichment together with the InfoLOOB objective. In experiments we compare CLOOB to CLIP after pre-training on the Conceptual Captions and the YFCC dataset with respect to their zero-shot transfer learning performance on other datasets. CLOOB consistently outperforms CLIP at zero-shot transfer learning across all considered architectures and datasets.
翻译:在零光传输学习任务方面,CLIP取得了令人印象深刻的成果,并被视为一种基础模型,如BERT或GPT3。 具有丰富代表性的CLIP愿景模型在对特定任务进行微调之前,先使用InfoNCE目标和自然语言监督进行预先培训,然后对特定任务进行精细调整。虽然CLIP在零光传输学习方面成绩优异,但受到一个解答问题的困扰,即它侧重于一个或几个特点,而忽视了其他相关特点。这一问题是由于在原始多模式数据中没有充分提取共差结构造成的。我们建议使用现代Hopfield网络解决解释问题。它们检索的嵌入中有一个因存储嵌入式特征的共变异结构。然而,现代Hopfield网络增加了InfoNCE目标的饱和效应,阻碍学习。我们提议使用InflooloOOOO Outst 数据转移目标。我们介绍了新的“CLO-COFS-CS-CS-C-C-C-CRODS recolnal relevelopment a laction acal laction dalviewnationalalalal legal legal lection ex ex ex levelopational laction ex exmoduction) exmoduction ex ex laction ex ex ex ex ex ex ex exmoductions exmoductions 之后,我们我们,我们我们我们,我们,我们引入了新的“CLOboltracoltracoltractions lectionalstalstalstal lemental lementalstal lementaltal lectionsal lemental lemental lemental lemental lemental lemental lectionsal lectionsal lections lemental ex ex ex exalalalal exal ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex lemental exal ex ex ex ex ex