Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training. Due to the absence of caption annotation, captioning models cannot be directly optimized via sequence-to-sequence training or CIDEr optimization. As a result, we present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which would heuristically optimize the output captions via paraphrasing. With P2C, the captioning model first learns paraphrasing from a language model pre-trained on text-only corpus, allowing expansion of the word bank for improving linguistic fluency. To further enforce the output caption sufficiently describing the visual content of the input image, we perform self-paraphrasing for the captioning model with fidelity and adequacy objectives introduced. Since no ground truth captions are available for novel object images during training, our P2C leverages cross-modality (image-text) association modules to ensure the above caption characteristics can be properly preserved. In the experiments, we not only show that our P2C achieves state-of-the-art performances on nocaps and COCO Caption datasets, we also verify the effectiveness and flexibility of our learning framework by replacing language and cross-modality association models for NOC. Implementation details and code are available in the supplementary materials.
翻译:新目标说明(NOC) 旨在描述含有对象的图像,但不在培训期间观察其地面真相说明。由于缺乏标题说明,字幕模型无法通过顺序到顺序培训或CIDER优化直接优化。结果,我们为NOC提供了一个双阶段学习框架(P2C),这是一个通过参数转换来超脱优化产出说明的阶段性能。有了P2C,字幕模型首先从一个语言模型中学习参数,该语言模型经过纯文本材料的预先训练,允许扩大文字库,以改善语言流畅。为了进一步实施充分描述输入图像的视觉内容的输出说明,我们用真实性和充分的目标为字幕模型进行自我引用。由于在培训期间没有为新对象图像提供基础真相说明,我们的P2C利用跨模式(模版)关联模块,以确保上述字幕特性得到妥善保存。在实验中,我们不仅展示了我们P2C补充材料的灵活性,而且通过学习格式化的CFS-C软件框架,我们也没有显示我们现有的格式和COFS-S-C格式,我们无法取代了我们现有的格式和CFO-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-Supolololollegleglegation Statealdaldaldald Stald Stald Stald Staldaldaldald Staldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldality Foldaldaldaldaldaldaldaldaldal ex Foldaldaldal ex Fors)格式,我们没有学习了我们学习了我们学习框架。