It is always well believed that modeling relationships between objects would be helpful for representing and eventually describing an image. Nevertheless, there has not been evidence in support of the idea on image description generation. In this paper, we introduce a new design to explore the connections between objects for image captioning under the umbrella of attention-based encoder-decoder framework. Specifically, we present Graph Convolutional Networks plus Long Short-Term Memory (dubbed as GCN-LSTM) architecture that novelly integrates both semantic and spatial object relationships into image encoder. Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections. The representations of each region proposed on objects are then refined by leveraging graph structure through GCN. With the learnt region-level features, our GCN-LSTM capitalizes on LSTM-based captioning framework with attention mechanism for sentence generation. Extensive experiments are conducted on COCO image captioning dataset, and superior results are reported when comparing to state-of-the-art approaches. More remarkably, GCN-LSTM increases CIDEr-D performance from 120.1% to 128.7% on COCO testing set.
翻译:人们始终相信,天体之间的建模关系将有助于显示和最终描述图像。然而,在技术上,我们没有证据支持图像描述生成的概念。在本文中,我们引入了一个新的设计,以探索在关注的编码器-解码器框架保护伞下图像字幕对象之间的联系。具体地说,我们展示了图表进化网络和长短期内存(以GCN-LSTM为底盘)结构,这些结构新颖地将语义和空间物体关系纳入图像编码器。在技术上,我们根据空间和语义连接在图像中检测到的物体上绘制图表。随后,我们通过GCN利用图形结构对每个目标上的拟议区域表示进行了改进。根据学习的区域特性,我们GCN-LSTM利用基于LSTM的字幕框架,建立了生成引人注意的机制。对COCO图像标注数据集进行了广泛的实验,在与最新CO方法比较时,报告了优异的结果。更显著的是,GCN-LSTM将CIDER-D的性能从120.1%升至128。