Unpaired Image Captioning (UIC) has been developed to learn image descriptions from unaligned vision-language sample pairs. Existing works usually tackle this task using adversarial learning and visual concept reward based on reinforcement learning. However, these existing works were only able to learn limited cross-domain information in vision and language domains, which restrains the captioning performance of UIC. Inspired by the success of Vision-Language Pre-Trained Models (VL-PTMs) in this research, we attempt to infer the cross-domain cue information about a given image from the large VL-PTMs for the UIC task. This research is also motivated by recent successes of prompt learning in many downstream multi-modal tasks, including image-text retrieval and vision question answering. In this work, a semantic prompt is introduced and aggregated with visual features for more accurate caption prediction under the adversarial learning framework. In addition, a metric prompt is designed to select high-quality pseudo image-caption samples obtained from the basic captioning model and refine the model in an iterative manner. Extensive experiments on the COCO and Flickr30K datasets validate the promising captioning ability of the proposed model. We expect that the proposed prompt-based UIC model will stimulate a new line of research for the VL-PTMs based captioning.
翻译:开发了未受重视的图像描述(UIC),以从不结盟的视觉语言样本中学习图像描述; 现有工作通常利用基于强化学习的对抗性学习和视觉概念奖励来完成这项任务; 然而,这些现有工作只能从视觉和语言领域学习有限的跨域信息,这限制了UIC的字幕性工作。 受本次研究中愿景语言前培训模型(VL-PTMs)的成功启发,我们试图推断从大型VL-PTMs中获取的关于UIC任务的某种特定图像的跨界插图象信息。这一研究的动力还来自最近在许多下游多模式任务中迅速学习的成功,包括图像文本检索和视觉问题回答。在这项工作中,引入了语义提示,并结合了视觉特征,以便在对抗性学习框架内进行更准确的字幕预测。 此外,我们设计了一种定量提示,从基本字幕模型中挑选出高质量的基于UPT-PM的图像插图样,并以迭式方式改进模型。 关于基于CO-Flick30K数据库的拟议快速研究能力模型的广泛实验,我们将对基于FlickL的拟议模型进行有希望的预测。