Wukong: 10亿个大型中华跨模式预科培训基准 (Wukong: A 100 Million Large-scale Chinese Cross-modal Pre-training Benchmark)

Vision-Language Pre-training (VLP) models have shown remarkable performance on various downstream tasks. Their success heavily relies on the scale of pre-trained cross-modal datasets. However, the lack of large-scale datasets and benchmarks in Chinese hinders the development of Chinese VLP models and broader multilingual applications. In this work, we release a large-scale Chinese cross-modal dataset named Wukong, which contains 100 million Chinese image-text pairs collected from the web. Wukong aims to benchmark different multi-modal pre-training methods to facilitate the VLP research and community development. Furthermore, we release a group of models pre-trained with various image encoders (ViT-B/ViT-L/SwinT) and also apply advanced pre-training techniques into VLP such as locked-image text tuning, token-wise similarity in contrastive learning, and reduced-token interaction. Extensive experiments and a benchmarking of different downstream tasks including a new largest human-verified image-text test dataset are also provided. Experiments show that Wukong can serve as a promising Chinese pre-training dataset and benchmark for different cross-modal learning methods. For the zero-shot image classification task on 10 datasets, $Wukong_{ViT-L}$ achieves an average accuracy of 73.03%. For the image-text retrieval task, it achieves a mean recall of 71.6% on AIC-ICC which is 12.9% higher than WenLan 2.0. Also, our Wukong models are benchmarked on downstream tasks with other variants on multiple datasets, e.g., Flickr8K-CN, Flickr-30K-CN, COCO-CN, et al. More information can be referred to: https://wukong-dataset.github.io/wukong-dataset/.

翻译：预培训前的愿景语言模型(VLP)在各种下游任务中表现出了显著的成绩。它们的成功在很大程度上取决于预先培训的跨模式数据集的规模。然而,中国缺乏大规模数据集和基准阻碍了中国VLP模型的开发以及更广泛的多语种应用。在这项工作中,我们发布了一个名为Wukong的大型中国跨模式数据集,其中包括从网络收集的1亿中国图像-文本配对。Wukong的目的是为不同的多模式预培训方法制定基准,以促进VLCNP的研究和社区发展。此外,我们发布了一组预先培训过的模型,配有各种图像编码编码(VT-B/VIT-L/SwinT),同时将先进的预培训技术应用于VLP,例如锁定图像调试,在对比性学习中标度相似,以及降低的交互互动。对于不同的下游任务,包括新的经核实的图像-文本测试数据配置,也提供了新的版本。实验显示Wuk-Wuk-Wld 用于不同版本的数据测试方法。

相关内容

MoDELS

关注 44

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/