Vision-language pre-training (VLP) on large-scale datasets has shown premier performance on various downstream tasks. A complete and fair benchmark (i.e., including large-scale pre-training datasets and diverse downstream tasks) is essential for VLP. While there are plenty of benchmarks with English corpus, building a rich benchmark for VLP with other languages, such as Chinese, remains a critical problem. To this end, we build a large-scale Chinese cross-modal benchmark called Zero for the research community to fairly compare VLP models. We release two pre-training datasets and five fine-tuning datasets for downstream tasks. Alongside, we propose a novel pre-training framework of pre-Ranking + Ranking for cross-modal learning. Specifically, we apply global contrastive pre-ranking to learn the individual representations of images and texts, respectively. We then fuse the representations in a fine-grained ranking manner via an image-text cross encoder and a text-image cross encoder. To further enhance the capability of the model, we propose a two-way distillation strategy consisting of target-guided Distillation and feature-guided Distillation. For brevity, we name our model R2D2. We achieve state-of-the-art performance on four public cross-modal datasets and the proposed five downstream datasets. When conducting zero-shot tasks on Flickr30k-CN, COCO-CN, and MUGE, R2D2 pre-trained on a 250 million dataset achieves significant improvements of 4.7%, 5.4%, and 6.3% in mean recall compared to the state-of-the-art. The datasets, models, and codes are available at https://github.com/yuxie11/R2D2
翻译:关于大型数据集的愿景预培训(VLP)显示,大型数据集的大规模中国跨模式基准(VLP)在各种下游任务方面表现最佳。完整而公平的基准(包括大规模培训前预数据集和各种下游任务)对VLP至关重要。虽然有大量与英文基准,但与中文等其他语言(如中文)一起为VLP建立丰富的基准。为此,我们建立了一个大型的中国跨模式基准,称为Zero,供研究界公平比较VLP模型。我们发布了两个培训前数据集和五个下游任务微调数据集。此外,我们提出了一个全新的预培训前框架,用于交叉模式学习。具体地说,我们用全球对比前列来分别了解图像和文字(如中文)的单个表达方式。我们随后通过一个图像-正文本的当前校对代码和文本的交叉编码2。为了进一步增强模型的能力,我们建议用双向方向的Dregal化模型,我们用双向方向的版本数据将数据转换成数据库。