Image caption generation is one of the most challenging problems at the intersection of vision and language domains. In this work, we propose a realistic captioning task where the input scenes may incorporate visual objects with no corresponding visual or textual training examples. For this problem, we propose a detection-driven approach that consists of a single-stage generalized zero-shot detection model to recognize and localize instances of both seen and unseen classes, and a template-based captioning model that transforms detections into sentences. To improve the generalized zero-shot detection model, which provides essential information for captioning, we define effective class representations in terms of class-to-class semantic similarities, and leverage their special structure to construct an effective unseen/seen class confidence score calibration mechanism. We also propose a novel evaluation metric that provides additional insights for the captioning outputs by separately measuring the visual and non-visual contents of generated sentences. Our experiments highlight the importance of studying captioning in the proposed zero-shot setting, and verify the effectiveness of the proposed detection-driven zero-shot captioning approach.
翻译:图像字幕生成是视觉和语言领域交汇处最具挑战性的问题之一。 在这项工作中,我们提出现实的字幕任务,输入场景可以包含视觉对象,而没有相应的视觉或文字培训实例。对于这个问题,我们提出一个探测驱动方法,包括一个单一阶段的通用零弹检测模型,以识别和定位视觉和非视觉类,以及一个基于模板的字幕模型,将检测转换成句子。为了改进通用零弹检测模型,为字幕提供必不可少的信息,我们定义了从阶级到阶级的语义相似性,并利用其特殊结构来构建一个有效的看不见/视觉类信任评分校准机制。我们还提出了一个新的评价指标,通过分别测量生成句子的视觉和非视觉内容,为字幕产出提供更多的见解。我们的实验强调在拟议的零弹道设置中研究字幕的重要性,并核实拟议的检测驱动零弹道字幕方法的有效性。