We introduce dense relational captioning, a novel image captioning task which aims to generate multiple captions with respect to relational information between objects in a visual scene. Relational captioning provides explicit descriptions for each relationship between object combinations. This framework is advantageous in both diversity and amount of information, leading to a comprehensive image understanding based on relationships, e.g., relational proposal generation. For relational understanding between objects, the part-of-speech (POS; i.e., subject-object-predicate categories) can be a valuable prior information to guide the causal sequence of words in a caption. We enforce our framework to learn not only to generate captions but also to understand the POS of each word. To this end, we propose the multi-task triple-stream network (MTTSNet) which consists of three recurrent units responsible for each POS which is trained by jointly predicting the correct captions and POS for each word. In addition, we found that the performance of MTTSNet can be improved by modulating the object embeddings with an explicit relational module. We demonstrate that our proposed model can generate more diverse and richer captions, via extensive experimental analysis on large scale datasets and several metrics. Then, we present applications of our framework to holistic image captioning, scene graph generation, and retrieval tasks.
翻译:我们引入了浓密的关系说明, 这是一项新颖的图像说明任务, 目的是在视觉场景中生成多个关于对象之间关联信息的字幕。 关系说明为每个对象组合之间的关系提供了明确的描述。 这个框架在信息的多样性和数量上都是有利的, 导致基于关系的全面图像理解, 例如关系建议生成。 对于对象之间的关联理解, 部分语音( POS; 即主题- 对象- 预设类别) 可以成为指导一个标题中对象的因果关系序列的宝贵前信息。 我们强制执行我们的框架, 不仅学习生成标题, 也学习理解每个字的 POS 。 为此, 我们提议多任务三流网络( MTTSNet), 由三个负责每个对象的经常性单位组成, 由共同预测正确的字幕和每个字词的 POS 来培训。 此外, 我们发现MTTSNet 网络的性能可以通过一个明确的关联模块来调整对象嵌入对象的序列序列来改进。 我们提出这个框架, 通过一个明确的关联模块来学习, 来学习每个字的 POS 的模型, 将产生更多样化和更丰富的模型 。