Humans possess the capacity to reason about the future based on a sparse collection of visual cues acquired over time. In order to emulate this ability, we introduce a novel task called Anticipation Captioning, which generates a caption for an unseen oracle image using a sparsely temporally-ordered set of images. To tackle this new task, we propose a model called A-CAP, which incorporates commonsense knowledge into a pre-trained vision-language model, allowing it to anticipate the caption. Through both qualitative and quantitative evaluations on a customized visual storytelling dataset, A-CAP outperforms other image captioning methods and establishes a strong baseline for anticipation captioning. We also address the challenges inherent in this task.
翻译:人类基于从时间顺序稀疏的一组视觉线索推断未来的能力使我们能够模仿这种能力,引入了一项称为预测性说明的新任务,使用时间间隔有序的一组稀疏的图像生成一个未知的神谕图像的说明。为了解决这个新任务,我们提出了一个名为A-CAP的模型,该模型将常识知识融入预训练的视觉-语言模型中,允许其预测说明。通过对自定义视觉叙事数据集的定性和定量评估,A-CAP优于其他图像说明方法,并为预测性说明奠定了坚实的基础。我们也探讨了这项任务固有的挑战。