Image captioning is a challenging task where the machine automatically describes an image by sentences or phrases. It often requires a large number of paired image-sentence annotations for training. However, a pre-trained captioning model can hardly be applied to a new domain in which some novel object categories exist, i.e., the objects and their description words are unseen during model training. To correctly caption the novel object, it requires professional human workers to annotate the images by sentences with the novel words. It is labor expensive and thus limits its usage in real-world applications. In this paper, we introduce the zero-shot novel object captioning task where the machine generates descriptions without extra sentences about the novel object. To tackle the challenging problem, we propose a Decoupled Novel Object Captioner (DNOC) framework that can fully decouple the language sequence model from the object descriptions. DNOC has two components. 1) A Sequence Model with the Placeholder (SM-P) generates a sentence containing placeholders. The placeholder represents an unseen novel object. Thus, the sequence model can be decoupled from the novel object descriptions. 2) A key-value object memory built upon the freely available detection model, contains the visual information and the corresponding word for each object. The SM-P will generate a query to retrieve the words from the object memory. The placeholder will then be filled with the correct word, resulting in a caption with novel object descriptions. The experimental results on the held-out MSCOCO dataset demonstrate the ability of DNOC in describing novel concepts in the zero-shot novel object captioning task.
翻译:图像字幕是一个具有挑战性的任务, 机器会自动用句子或句子描述图像。 它通常需要大量的配对图像描述说明来进行培训。 但是, 一个经过预先训练的字幕模型很难应用到新领域, 新对象类别存在, 即, 对象及其描述词在模式培训期间是看不见的。 要正确描述该新对象, 它需要专业的人类工作者用新词来说明图像。 它需要劳动成本高, 从而限制其在现实世界应用程序中的使用 。 在本文中, 我们引入一个零射新对象说明任务, 机器生成描述时没有额外能力的新对象描述。 但是, 要解决这个具有挑战性的问题, 我们建议一个可以完全将语言序列模型与新对象描述调和新对象的描述进行调和。 DNOC 有两个组成部分。 1 与占位符( SM- P) 一起生成一个包含占位符的句子句子。 占位符代表着一个新发现的新对象。 因此, 序列模型可以从新对象的物体描述中解析出关于新对象描述新对象描述新对象的能力 。 要显示的轨道, 将显示每个可获取的轨道 。 。 将创建目标显示的轨道, 将创建目标记录中, 将显示的轨道 。