CLIP-Lite:从文字说明中学习信息效率高的视觉代表 (CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations)

We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +15.4% mAP absolute gain in performance on Pascal VOC classification, and a +22.1% top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, by performing explicit image-text alignment during representation learning, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks.

翻译：我们提出CLIP-Lite, 这是一种信息高效方法, 用于通过特征与文本说明保持一致来进行视觉表现学习。与先前提议的 CLIP 模型相比, CLIP- Lite 在优化其对比性学习目标时, 只需要对每个正面图像文本样本使用一个负图像-文本样本。我们通过利用一个信息高效度较低且能最大限度地增加两种输入模式之间的相互信息来实现这一目标。这样, CLIP- Lite 就可以在获得比 CLIP 更好的性能的同时,以大量数据和批量大小来接受培训。我们通过对 COCO- Caption 数据集进行预培训来评估 CLIP- Lite, 测试将学习转移到其他数据集。 CLIP- Lite 在Pascal VOC 分类和图像网络上, +22.1% 的顶级-1 准确度收益中获得了绝对收益。同时, CLIP- Lite 在图像和文本检索、零发分级分类和直观地面演示中, 通过进行清晰的图像- 图像- L 展示, 能够显示我们使用的图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 显示- 显示- 显示- 能够显示- 显示- 显示- 显示- 显示- 显示- 显示- 直观显示- 显示- C- L- 图像- 图像- 图像- 显示- 显示- 图像- 图像- 图像- 图像- 显示- 显示- 图像- 显示- 图像- 上- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 图像- 的的演示演示显示- 图像- 图像- 图像- 显示- 图像- 显示- 显示- 上- 的的的的的的的演示- 的的演示- 显示- 显示- 演示- 演示- 演示- 演示- 图像-