Image paragraph captioning aims to describe a given image with a sequence of coherent sentences. Most existing methods model the coherence through the topic transition that dynamically infers a topic vector from preceding sentences. However, these methods still suffer from immediate or delayed repetitions in generated paragraphs because (i) the entanglement of syntax and semantics distracts the topic vector from attending pertinent visual regions; (ii) there are few constraints or rewards for learning long-range transitions. In this paper, we propose a bypass network that separately models semantics and linguistic syntax of preceding sentences. Specifically, the proposed model consists of two main modules, i.e. a topic transition module and a sentence generation module. The former takes previous semantic vectors as queries and applies attention mechanism on regional features to acquire the next topic vector, which reduces immediate repetition by eliminating linguistics. The latter decodes the topic vector and the preceding syntax state to produce the following sentence. To further reduce delayed repetition in generated paragraphs, we devise a replacement-based reward for the REINFORCE training. Comprehensive experiments on the widely used benchmark demonstrate the superiority of the proposed model over the state of the art for coherence while maintaining high accuracy.
翻译:图像段落标题旨在描述特定图像,按顺序排列一系列连贯的句子。大多数现有方法都通过主题过渡模式,动态地从前几句中推断出一个主题矢量。但是,这些方法仍然在生成段落中立即或延迟重复,因为(一) 语法和语义的纠缠会将主题矢量从相关视觉区域引开;(二) 学习长程过渡的制约或奖励不多。在本文件中,我们建议建立一个绕行网络,分别模拟前几句的语义和语言合成。具体地说,拟议的模式由两个主要模块组成,即主题过渡模块和句生成模块。前者以先前的语义矢量作为查询对象,并运用区域特性的注意机制获取下一个主题矢量,通过删除语言来减少立即重复。后一种定义了主题矢量和先前的语法状态,以产生以下句子。为了进一步减少生成段落的延迟重复,我们为REINFORCE培训设计了一个基于替代的奖励。关于广泛使用的基准的全面实验显示拟议模型在保持高度一致性的同时相对于艺术状态的高度精确性。