CLIP yielded impressive results on zero-shot transfer learning tasks and is considered as a foundation model like BERT or GPT3. CLIP vision models that have a rich representation are pre-trained using the InfoNCE objective and natural language supervision before they are fine-tuned on particular tasks. Though CLIP excels at zero-shot transfer learning, it suffers from an explaining away problem, that is, it focuses on one or few features, while neglecting other relevant features. This problem is caused by insufficiently extracting the covariance structure in the original multi-modal data. We suggest to use modern Hopfield networks to tackle the problem of explaining away. Their retrieved embeddings have an enriched covariance structure derived from co-occurrences of features in the stored embeddings. However, modern Hopfield networks increase the saturation effect of the InfoNCE objective which hampers learning. We propose to use the InfoLOOB objective to mitigate this saturation effect. We introduce the novel ``Contrastive Leave One Out Boost'' (CLOOB), which uses modern Hopfield networks for covariance enrichment together with the InfoLOOB objective. In experiments we compare CLOOB to CLIP after pre-training on the Conceptual Captions and the YFCC dataset with respect to their zero-shot transfer learning performance on other datasets. CLOOB consistently outperforms CLIP at zero-shot transfer learning across all considered architectures and datasets.
翻译:在零光传输学习任务方面,CLIP取得了令人印象深刻的成果,并被视为一种基础模型,如BERT或GPT3. 具有丰富代表性的CLIP愿景模型在对特定任务进行微调之前,先使用InfoNCE客观和自然语言监督进行预先培训。虽然CLIP在零光传输学习方面成绩优异,但受到一个解答问题的影响,即它侧重于一个或几个特点,而忽视了其他相关特点。这一问题是由于没有充分地提取所有原始多模式数据中的共变结构造成的。我们建议使用现代Hopfield网络解决解析问题。它们检索到的嵌入式网络具有因存储嵌入中特征的共生而形成的更丰富的共变结构。然而,现代Hopfield网络增加了InfoNCE目标的饱和效应,妨碍学习。我们提议使用InfloolooloOOBES的具体目标来减轻这一饱和效应。我们介绍了新的“Clobal Le Out BO't ” (CLOOBO ) 网络,它使用现代 Hostabfielding Creal ex ex ex ex ex ex ex levelopmental ex ex ex ex ex exmogradustreventalal ex lading lading lady lax laves lady lax lading lading lax lax lating lax lax lax lating lating lating lades lades lades lades lax lades lax lax lax lax lax lax lax lax lax lax lax labaldal labaldaldaldaldaldaldal) ladaldaldaldaldaldaldaldaldaldal) lading lax lax lax ladal lautdal be lax lax lax lax lax lax lax lax lax lax lax lax lax lax 后,