Image Captioning is a traditional vision-and-language task that aims to generate the language description of an image. Recent studies focus on scaling up the model size and the number of training data, which significantly increase the cost of model training. Different to these heavy-cost models, we introduce a lightweight image captioning framework (I-Tuning), which contains a small number of trainable parameters. We design a novel I-Tuning cross-attention module to connect the non-trainable pre-trained language decoder GPT2 and vision encoder CLIP-ViT. Since most parameters are not required to be updated during training, our framework is lightweight and fast. Experimental results conducted on three image captioning benchmarks reveal that our framework achieves comparable or better performance than the large-scale baseline systems. But our models contain up to 10 times fewer trainable parameters and require much fewer data for training compared with state-of-the-art baselines.
翻译:图像描述是一项传统的视觉和语言任务,目的是生成图像的语言描述。 最近的研究侧重于扩大模型规模和培训数据的数量,这大大增加了模型培训的成本。 与这些重成本模型不同,我们引入了一个轻量级图像说明框架(I-Tuning),其中包含少量可培训参数。 我们设计了一个创新的I-Tuning交叉关注模块,将非培训前语言解码器 GPT2 和愿景编码器 CLIP-ViT 连接起来。 由于大多数参数在培训期间不需要更新,我们的框架是轻量和快速的。 三个图像说明基准的实验结果显示,我们的框架比大型基线系统具有可比或更好的性能。 但是,我们的模型包含最多10倍的可培训参数,培训所需的数据比最先进的基线要少得多。</s>