Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate strong zero-shot transfer capability in many discriminative tasks. Their adaptation to zero-shot image-conditioned text generation tasks has drawn increasing interest. Prior arts approach to zero-shot captioning by either utilizing the existing large language models (e.g., GPT-2) or pre-training the encoder-decoder network in an end-to-end manner. In this work, we propose a simple framework, named DeCap, for zero-shot captioning. We introduce a lightweight visual-aware language decoder. This decoder is both data-efficient and computation-efficient: 1) it only requires the text data for training, easing the burden on the collection of paired data. 2) it does not require end-to-end training. When trained with text-only data, the decoder takes the text embedding extracted from the off-the-shelf CLIP encoder as a prefix embedding. The challenge is that the decoder is trained on the text corpus but at the inference stage, it needs to generate captions based on visual inputs. The modality gap issue is widely observed in multi-modal contrastive models that prevents us from directly taking the visual embedding as the prefix embedding. We propose a training-free mechanism to reduce the modality gap. We project the visual embedding into the CLIP text embedding space, while the projected embedding retains the information of the visual input. Taking the projected embedding as the prefix embedding, the decoder generates high-quality descriptions that match the visual input. The experiments show that DeCap outperforms other zero-shot captioning methods and unpaired captioning methods on the typical image captioning benchmarks, i.e., MSCOCO and NoCaps.
翻译:大型的预培训多式模型( 如 CLIP ) 显示在许多歧视性任务中强势零发传输能力 。 它们适应零发图像的文本生成任务已引起越来越多的兴趣 。 使用现有大型语言模型( 如 GPT-2) 或以端到端方式对编码器解码器网络进行预培训, 对零发字幕使用前的艺术方法 。 在这项工作中, 我们提议了一个简单的框架, 名为 DeCap, 用于零发字幕。 我们引入了一种轻量的视觉智能语言解码。 这个解码器既具有数据效率和计算效率, 也具有计算效率:1 它只需要用于培训的文本数据, 减轻收集配对配对数据的数据的负担。 2 它不需要端到端前培训。 解码器将从现成的 CLIP 解码编码编码作为零发字幕嵌入前嵌入。 挑战在于, 解码器在文本库中被训练, 但是在直观的 C 输入中, 将预嵌入的预嵌入模式 生成。 它需要生成一个预置的预置的预置的预置版本 。 。 将预置的预置的预置的预置的预置的预置的预置的预置的预置的预置的预置版本 。</s>