Vision-and-language pre-trained models (VLMs) have achieved tremendous success in the cross-modal area, but most of them require a large amount of parallel image-caption data for pre-training. Collating such data is expensive and labor-intensive. In this work, we focus on reducing such need for generative vision-and-language pre-training (G-VLP) by taking advantage of the visual pre-trained model (CLIP-ViT) as encoder and language pre-trained model (GPT2) as decoder. Unfortunately, GPT2 lacks a necessary cross-attention module, which hinders the direct connection of CLIP-ViT and GPT2. To remedy such defects, we conduct extensive experiments to empirically investigate how to design and pre-train our model. Based on our experimental results, we propose a novel G-VLP framework, Visual Conditioned GPT (VC-GPT), and pre-train it with a small-scale image-caption corpus (Visual Genome, only 110k distinct images). Evaluating on the image captioning downstream tasks (MSCOCO and Flickr30k Captioning), VC-GPT achieves either the best or the second-best performance across all evaluation metrics over the previous works which consume around 30 times more distinct images during cross-modal pre-training.
翻译:不幸的是,GPT2缺少一个必要的交叉注意模块,这阻碍了CLIP-ViT和GPT2的直接连接。为了纠正这些缺陷,我们进行了广泛的实验,以便根据经验调查如何设计和预设我们的模型。 根据我们的实验结果,我们提出了一个新的G-VLP框架,即视觉预设GPT(VCC-GPT),并预设了一个小规模的跨版图像集(VVVG-GPT,仅110k分辨图像)。