The Contrastive Language-Image Pre-training (CLIP) Model is a recently proposed large-scale pre-train model which attracts increasing attention in the computer vision community. Benefiting from its gigantic image-text training set, the CLIP model has learned outstanding capabilities in zero-shot learning and image-text matching. To boost the recognition performance of CLIP on some target visual concepts, it is often desirable to further update the CLIP model by fine-tuning some classes-of-interest on extra training data. This operation, however, raises an important concern: will the update hurt the zero-shot learning or image-text matching capability of the CLIP, i.e., the catastrophic forgetting issue? If yes, could existing continual learning algorithms be adapted to alleviate the risk of catastrophic forgetting? To answer these questions, this work conducts a systemic study on the continual learning issue of the CLIP model. We construct evaluation protocols to measure the impact of fine-tuning updates and explore different ways to upgrade existing continual learning methods to mitigate the forgetting issue of the CLIP model. Our study reveals the particular challenges of CLIP continual learning problem and lays a foundation for further researches. Moreover, we propose a new algorithm, dubbed Learning without Forgetting via Replayed Vocabulary (VR-LwF), which shows exact effectiveness for alleviating the forgetting issue of the CLIP model.
翻译:反语言图像培训前培训模式(CLIP)是最近提出的大规模培训前模型,在计算机视觉界吸引了越来越多的注意力。CLIP模式得益于其巨大的图像文本培训集,在零发学习和图像文本匹配方面学到了杰出的能力。为了提高CLIP在一些目标视觉概念上的认知性能,通常有必要进一步更新CLIP模式,微调一些兴趣级别对额外培训数据的影响。然而,这一操作引起了一个重大关切:更新是否会损害CLIP零发学习或图像文本匹配能力,即灾难性的遗忘问题?如果是的话,现有的持续学习算法能够适应减轻灾难性遗忘的风险?为了回答这些问题,这项工作对CLIP模式的持续学习问题进行系统研究。我们制定评价协议,以衡量微调更新问题的影响,并探索不同的方式更新现有的持续学习方法,以减轻CLIP模式的忘忘却问题。我们的研究揭示了CLIP持续学习算法的特殊挑战,没有通过VLL学习法进行我们学习。