Recent years have witnessed the fast development of large-scale pre-training frameworks that can extract multi-modal representations in a unified form and achieve promising performances when transferred to downstream tasks. Nevertheless, existing approaches mainly focus on pre-training with simple image-text pairs, while neglecting the semantic connections between concepts from different modalities. In this paper, we propose a knowledge-based pre-training framework, dubbed Knowledge-CLIP, which injects semantic information into the widely used CLIP model. Through introducing knowledge-based objectives in the pre-training process and utilizing different types of knowledge graphs as training data, our model can semantically align the representations in vision and language with higher quality, and enhance the reasoning ability across scenarios and modalities. Extensive experiments on various vision-language downstream tasks demonstrate the effectiveness of Knowledge-CLIP compared with the original CLIP and competitive baselines.
翻译:近年来,大规模培训前框架迅速发展,能够以统一的形式获得多种模式的表述,并在转入下游任务时取得有希望的业绩;然而,现有方法主要侧重于培训前的简单图像文本配对,同时忽视不同模式概念之间的语义联系;在本文件中,我们提议了一个知识性培训前框架,称为知识性知识-CLIP,将语义信息注入广泛使用的CLIP模式;通过在培训前进程中引入知识性目标,并利用不同类型的知识图表作为培训数据,我们的模式可以将愿景和语言的表述进行语义上的调整,提高质量,提高不同情景和模式的推理能力;对各种愿景性下游任务的广泛实验表明,与原CLIP和竞争性基线相比,知识-CLIP的有效性。