Although pre-trained language models (PLMs) have shown impressive performance by text-only self-supervised training, they are found lack of visual semantics or commonsense, e.g., sizes, shapes, and colors of commonplace objects. Existing solutions often rely on explicit images for visual knowledge augmentation (requiring time-consuming retrieval or generation), and they also conduct the augmentation for the whole input text, without considering whether it is actually needed in specific inputs or tasks. To address these issues, we propose a novel visually-augmented fine-tuning approach that can be generally applied to various PLMs or NLP tasks, without using any retrieved or generated images, namely VAWI. Specifically, we first identify the visually-hungry words (VH-words) from input text via a token selector, where three different methods have been proposed, including syntax-, attention- and learning-based strategies. Then, we adopt a fixed CLIP text encoder to generate the visually-augmented representations of these VH-words. As it has been pre-trained by vision-language alignment task on the large-scale corpus, it is capable of injecting visual semantics into the aligned text representations. Finally, the visually-augmented features will be fused and transformed into the pre-designed visual prompts based on VH-words, which can be inserted into PLMs to enrich the visual semantics in word representations. We conduct extensive experiments on ten NLP tasks, i.e., GLUE benchmark, CommonsenseQA, CommonGen, and SNLI-VE. Experimental results show that our approach can consistently improve the performance of BERT, RoBERTa, BART, and T5 at different scales, and outperform several competitive baselines significantly. Our codes and data are publicly available at~\url{https://github.com/RUCAIBox/VAWI}.
翻译:虽然经过培训的语言模型(PLMs)通过只对文本进行自我监督的培训表现出了令人印象深刻的性能,但发现它们缺乏视觉语义或普通语言,例如普通对象的大小、形状和颜色。现有的解决方案往往依靠清晰的图像来增强视觉知识(需要花费时间的检索或生成),它们也对整个输入文本进行增强,而没有考虑具体投入或任务是否实际需要。为了解决这些问题,我们建议了一种新的视觉精细调整方法,可以普遍应用于各种 PLM或NLP任务,而没有使用任何已检索或生成的图像,即 VAWI。具体地说,我们首先通过一个符号选择器来识别输入文本中的视觉饥饿词(VH字),这三种不同的方法,包括语句前、关注和学习战略。然后,我们采用固定的 CLIPLIP 文字解码来生成这些VH字的视觉放大图示表。我们通过视觉语言模型前的预训练, 将OVLIDLM 升级到高层次的图像缩缩图案。