Image caption generation is one of the most challenging problems at the intersection of visual recognition and natural language modeling domains. In this work, we propose and study a practically important variant of this problem where test images may contain visual objects with no corresponding visual or textual training examples. For this problem, we propose a detection-driven approach based on a generalized zero-shot detection model and a template-based sentence generation model. In order to improve the detection component, we jointly define a class-to-class similarity based class representation and a practical score calibration mechanism. We also propose a novel evaluation metric that provides complimentary insights to the captioning outputs, by separately handling the visual and non-visual components of the captions. Our experiments show that the proposed zero-shot detection model obtains state-of-the-art performance on the MS-COCO dataset and the zero-shot captioning approach yields promising results.
翻译:图像字幕生成是视觉识别和自然语言建模领域交汇过程中最具挑战性的问题之一。 在这项工作中,我们提出并研究一个实际重要的问题变方,即测试图像可能包含视觉对象,但没有相应的视觉或文字培训实例。对于这一问题,我们提出一个基于通用零射探测模型和基于模板的句子生成模型的探测驱动方法。为了改进检测部分,我们共同定义了基于类比相似的类比代表制和实用分数校准机制。我们还提出了一个新的评估指标,通过分别处理字幕的视觉和非视觉部分,为字幕输出提供补充性见解。我们的实验显示,拟议的零光检测模型在MS-CO数据集和零光字幕方法上取得了最先进的性能,并产生了良好的效果。