在有种子和未见对象类别的现场上生成标题 (Caption Generation on Scenes with Seen and Unseen Object Categories)

Image caption generation is one of the most challenging problems at the intersection of vision and language domains. In this work, we propose a realistic captioning task where the input scenes may incorporate visual objects with no corresponding visual or textual training examples. For this problem, we propose a detection-driven approach that consists of a single-stage generalized zero-shot detection model to recognize and localize instances of both seen and unseen classes, and a template-based captioning model that transforms detections into sentences. To improve the generalized zero-shot detection model, which provides essential information for captioning, we define effective class representations in terms of class-to-class semantic similarities, and leverage their special structure to construct an effective unseen/seen class confidence score calibration mechanism. We also propose a novel evaluation metric that provides additional insights for the captioning outputs by separately measuring the visual and non-visual contents of generated sentences. Our experiments highlight the importance of studying captioning in the proposed zero-shot setting, and verify the effectiveness of the proposed detection-driven zero-shot captioning approach.

翻译：图像字幕生成是视觉和语言领域交汇处最具挑战性的问题之一。在这项工作中,我们提出现实的字幕任务,输入场景可以包含视觉对象,而没有相应的视觉或文字培训实例。对于这个问题,我们提出一个探测驱动方法,包括一个单一阶段的通用零弹检测模型,以识别和定位视觉和非视觉类,以及一个基于模板的字幕模型,将检测转换成句子。为了改进通用零弹检测模型,为字幕提供必不可少的信息,我们定义了从阶级到阶级的语义相似性,并利用其特殊结构来构建一个有效的看不见/视觉类信任评分校准机制。我们还提出了一个新的评价指标,通过分别测量生成句子的视觉和非视觉内容,为字幕产出提供更多的见解。我们的实验强调在拟议的零弹道设置中研究字幕的重要性,并核实拟议的检测驱动零弹道字幕方法的有效性。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日