Our goal in this work is to train an image captioning model that generates more dense and informative captions. We introduce "relational captioning," a novel image captioning task which aims to generate multiple captions with respect to relational information between objects in an image. Relational captioning is a framework that is advantageous in both diversity and amount of information, leading to image understanding based on relationships. Part-of speech (POS, i.e. subject-object-predicate categories) tags can be assigned to every English word. We leverage the POS as a prior to guide the correct sequence of words in a caption. To this end, we propose a multi-task triple-stream network (MTTSNet) which consists of three recurrent units for the respective POS and jointly performs POS prediction and captioning. We demonstrate more diverse and richer representations generated by the proposed model against several baselines and competing methods.
翻译:我们在这项工作中的目标是训练一个图像字幕模型,该模型产生更加密集和内容更加丰富的字幕。我们引入了“关系字幕”,这是一项新颖的图像字幕任务,旨在生成图像中对象之间关联信息方面的多个标题。关系字幕是一个在信息的多样性和数量上都具有优势的框架,导致基于关系对图像的理解。可以给每一个英语单词指定部分语言标记(POS,即主题对象和对象类别)。我们利用 POS作为前台指导标题中正确顺序的文字。为此,我们提议建立一个多任务三流网络(MTTSNet),由不同的 POS 三个经常性单位组成,共同进行POS 预测和字幕。我们用多个基线和相互竞争的方法展示了拟议模型产生的更加多样和更加丰富的表达方式。