Image token removal is an efficient augmentation strategy for reducing the cost of computing image features. However, this efficient augmentation strategy has been found to adversely affect the accuracy of CLIP-based training. We hypothesize that removing a large portion of image tokens may improperly discard the semantic content associated with a given text description, thus constituting an incorrect pairing target in CLIP training. To address this issue, we propose an attentive token removal approach for CLIP training, which retains tokens with a high semantic correlation to the text description. The correlation scores are computed in an online fashion using the EMA version of the visual encoder. Our experiments show that the proposed attentive masking approach performs better than the previous method of random token removal for CLIP training. The approach also makes it efficient to apply multiple augmentation views to the image, as well as introducing instance contrastive learning tasks between these views into the CLIP framework. Compared to other CLIP improvements that combine different pre-training targets such as SLIP and MaskCLIP, our method is not only more effective, but also much more efficient. Specifically, using ViT-B and YFCC-15M dataset, our approach achieves $43.9\%$ top-1 accuracy on ImageNet-1K zero-shot classification, as well as $62.7/42.1$ and $38.0/23.2$ I2T/T2I retrieval accuracy on Flickr30K and MS COCO, which are $+1.1\%$, $+5.5/+0.9$, and $+4.4/+1.3$ higher than the SLIP method, while being $2.30\times$ faster. An efficient version of our approach running $1.16\times$ faster than the plain CLIP model achieves significant gains of $+5.3\%$, $+11.3/+8.0$, and $+9.5/+4.9$ on these benchmarks.
翻译:为解决这一问题,我们建议为CLIP培训采取关注的象征性删除方法,即:为降低计算图像功能的成本而保留具有高语义相关性的象征性删除战略。然而,这一高效增强战略被认为对基于 CLIP 的培训的准确性产生了不利影响。我们假设,删除一大部分图像符号可能会不适当地丢弃与给定文本描述有关的语义内容,从而在 CLIP 培训中构成一个不正确的配对目标。为了解决这一问题,我们建议为CLIP培训采取关注的象征性删除方法,为CLIP培训保留象征,与文本描述保持高语义相关性的关联性相关性。 相关评分以在线方式计算,使用以 EMA 版本的视觉编码$$1 。我们的实验显示,拟议的“关注掩码”方法比以前为 CLIP 的随机符号删除方法要好得多,因此在CLIP 培训框架中引入了多重增强观点之间的对比性学习任务。 与其他CLIP改进方法相比,SLIP $ $ 和 MaskCLCLIIP $ $ $ $9, 我们的方法不仅有效, $9, 和MFIFCERS 3 3 20 和R 的高级数据, 和Yxxxxxxxx, 。