Understanding interactions between objects in an image is an important element for generating captions. In this paper, we propose a relationship-based neural baby talk (R-NBT) model to comprehensively investigate several types of pairwise object interactions by encoding each image via three different relationship-based graph attention networks (GATs). We study three main relationships: \textit{spatial relationships} to explore geometric interactions, \textit{semantic relationships} to extract semantic interactions, and \textit{implicit relationships} to capture hidden information that could not be modelled explicitly as above. We construct three relationship graphs with the objects in an image as nodes, and the mutual relationships of pairwise objects as edges. By exploring features of neighbouring regions individually via GATs, we integrate different types of relationships into visual features of each node. Experiments on COCO dataset show that our proposed R-NBT model outperforms state-of-the-art models trained on COCO dataset in three image caption generation tasks.
翻译:在本文中,我们提出了一个基于关系的神经婴儿谈话(R-NBT)模型,以便通过三个基于关系的图形关注网络(GATs)对每张图像进行编码,从而全面调查几类双向对象的相互作用。我们研究了三个主要关系:\ textit{spatial relations} 以探索几何相互作用,\ textit{semantic relations} 以提取语义相互作用,和\ textit{implic relation} 以捕捉无法以上述方式明确模拟的隐藏信息。我们与作为节点的物体以及作为边缘的对等对象的相互关系图。我们通过GATs单独探索相邻区域的特征,将不同类型的关系纳入每个节点的视觉特征中。对COCOCO数据集的实验表明,我们拟议的R-NBT模型超越了在三个图像字幕生成任务中培训的CO数据集的状态和艺术模型。