Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.
翻译:近些年来,对多模式培训前模式进行了深入探索,以连接愿景和语言,然而,大多数模式都明确模拟了图像-文本对面之间的跨模式互动,假设文本和图像模式之间存在很强的语义相关性。由于这一强有力的假设在现实世界情景中往往无效,我们选择隐含地模拟大规模多模式培训前模式的跨模式相关关系,这是我们团队领导的中国“WenLan”项目的重点。具体地说,由于对图像-文本的假设关系较弱,我们提议在跨模式对比学习框架内采用称为BriVL的双向培训前两台预培训前模型。与采用简单对比学习方法的OpenAI CLIP不同,我们设计了一种更先进的算法,将最新方法调整为跨模式。通过建立大型以队列为基础的字典,我们的BriVL可以将更多负面样本纳入有限的GPU资源中。我们进一步构建了一个名为RUC-CAS-WenLan的大型多源图像数据集,称为RIC-CAS-WenLan的双向下游模式,用以测试我们的BRIVFIMIML的下游模型。