The tremendous success of CLIP (Radford et al., 2021) has promoted the research and application of contrastive learning for vision-language pretraining. In this work, we construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets, and we pretrain Chinese CLIP models on the new dataset. We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters. Furthermore, we propose a two-stage pretraining method, where the model is first trained with the image encoder frozen and then trained with all parameters being optimized, to achieve enhanced model performance. Our comprehensive experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN in the setups of zero-shot learning and finetuning, and it is able to achieve competitive performance in zero-shot image classification based on the evaluation on the ELEVATER benchmark (Li et al., 2022). We have released our codes, models, and demos in https://github.com/OFA-Sys/Chinese-CLIP
翻译:CLIP(Radford等人,2021年)取得了巨大成功,促进了对视觉语言预培训进行对比学习的研究和应用,在这项工作中,我们建造了中国成对图像文本的大规模数据集,大部分数据都是从公开的数据集中检索的,我们预先将中国中国的CLIP模型纳入新的数据集中,我们开发了5个中国多尺寸的多尺寸CLIP模型,范围从7 700万至9.58亿个参数,此外,我们提出一个两阶段的预培训方法,首先对模型进行图像编码器冷冻培训,然后对所有参数进行优化培训,以实现增强模型性能。我们的全面实验表明,中国的CLIP能够实现MUGE、Flick30K-CN和COCO-CN的最新性能,在零发学习和微调的集集中,并且能够根据ELEVATER基准(Li等人等人,2022年)的评估,在零发图像分类方面实现竞争性业绩。我们已经在https://gius/CLOFA/DRUS/DRUS/CRIS/CRA/CRIS/CRVAR.