During the preceding biennium, vision-language pre-training has achieved noteworthy success on several downstream tasks. Nevertheless, acquiring high-quality image-text pairs, where the pairs are entirely exclusive of each other, remains a challenging task, and noise exists in the commonly used datasets. To address this issue, we propose SoftCLIP, a novel approach that relaxes the strict one-to-one constraint and achieves a soft cross-modal alignment by introducing a softened target, which is generated from the fine-grained intra-modal self-similarity. The intra-modal guidance is indicative to enable two pairs have some local similarities and model many-to-many relationships between the two modalities. Besides, since the positive still dominates in the softened target distribution, we disentangle the negatives in the distribution to further boost the relation alignment with the negatives in the cross-modal learning. Extensive experiments demonstrate the effectiveness of SoftCLIP. In particular, on ImageNet zero-shot classification task, using CC3M/CC12M as pre-training dataset, SoftCLIP brings a top-1 accuracy improvement of 6.8%/7.2% over the CLIP baseline.
翻译:在过去的两年中,视觉-语言预训练在多个下游任务上取得了显著的成功。尽管如此,获取高质量的图像-文本对,其中对是完全互斥的,仍然是一个具有挑战性的任务,常用数据集中存在噪声。为了解决这个问题,我们提出了 SoftCLIP,一种新颖的方法,通过引入一个软化的目标,从细粒度的自相似性中生成,来放松严格的一对一约束,实现了软交叉模态对齐。在模态内部的指导有助于使两对具有一些局部相似性,并在两种模态之间建立多对多的关系。此外,由于正样本仍然在软化的目标分布中占主导地位,我们对目标分布中的负样本进行了解耦以进一步提高关系对齐。广泛的实验证明了 SoftCLIP 的有效性。特别是,在 ImageNet 的零样本分类任务中,使用 CC3M/CC12M 作为预训练数据集,SoftCLIP 在 CLIP 基线上带来了 6.8%/7.2% 的 top-1 准确率提升。