Generative language models (LMs) such as GPT-2/3 can be prompted to generate text with remarkable quality. While they are designed for text-prompted generation, it remains an open question how the generation process could be guided by modalities beyond text such as images. In this work, we propose a training-free framework, called MAGIC (iMAge-Guided text generatIon with CLIP), for plugging in visual controls in the generation process and enabling LMs to perform multimodal tasks (e.g., image captioning) in a zero-shot manner. MAGIC is a simple yet efficient plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation. During decoding, MAGIC influences the generation of the LM by introducing a CLIP-induced score, called magic score, which regularizes the generated result to be semantically related to a given image while being coherent to the previously generated context. Notably, the proposed decoding scheme does not involve any gradient update operation, therefore being computationally efficient. On the challenging task of zero-shot image captioning, MAGIC outperforms the state-of-the-art method by notable margins with a nearly 27 times decoding speedup. MAGIC is a flexible framework and is theoretically compatible with any text generation tasks that incorporate image grounding. In the experiments, we showcase that it is also capable of performing visually grounded story generation given both an image and a text prompt.
翻译:GPT-2/3 等生成语言模型(LMS) 可用于生成质量惊人的文本。 虽然这些模型是为文本快速生成而设计的, 但仍是一个未决问题。 在这项工作中, 我们提议了一个名为 MAGIC (iMAGE- Guided Text genratiIon with CLIP) 的无培训框架, 用于在生成过程中插入视觉控制, 并使 LMS 能够以零发方式执行多式任务( 如图像字幕) 。 MAGIC 是一个简单而高效的插头和播放框架, 直接结合了超现LMM( 即 GPT-2) (即 GPT-2) 和 图像背景生成模型( 即 CLIP) 。 在解码过程中, MAGIC 通过引入 CLIP 诱导的分, 称为魔术评分, 使生成的结果与给定的图像具有性关联性, 同时又与先前生成的背景一致。 显而易见的MAIC 中, 运行一个不易变现的版本, 因此, 运行一个快速的版本方案涉及一个快速的版本, 将一个快速的版本转换, 因此, 将一个快速的版本的版本与任何磁变换的版本, 系统, 与任何驱动式的版本的版本与任何磁变换的版本, 的版本, 。