Recently, tuning the pre-trained language model (PLM) in a parameter-efficient manner becomes a popular topic in the natural language processing area. However, most of them focus on tuning the PLM with the text-only information. In this work, we propose a new perspective to tune the frozen PLM with images for caption generation. We denote our method as I-Tuning, which can automatically filter the vision information from images to adjust the output hidden states of PLM. Evaluating on the image captioning tasks (MSCOCO and Flickr30k Captioning), our method achieves comparable or even better performance than the previous models which have 2-4 times more trainable parameters and/or consume a large amount of cross-modal pre-training data.
翻译:最近,在自然语言处理领域,以有参数效率的方式对预先培训的语言模式(PLM)进行调控成为流行话题,但大多数都侧重于用只读文本的信息对PLM进行调控。在这项工作中,我们提出了一个新视角,用字幕生成的图像调控冻结的PLM。我们用 I-Tuning 表示我们的方法,它可以自动从图像中过滤视觉信息,以调整PLM 的输出隐藏状态。对图像描述任务(MCCO 和 Flick30k 说明)进行评估,我们的方法比以前的模型(具有2-4倍的可培训参数和/或消耗大量跨模式培训前数据)取得类似甚至更好的性能。