Existing image captioning methods just focus on understanding the relationship between objects or instances in a single image, without exploring the contextual correlation existed among contextual image. In this paper, we propose Dual Graph Convolutional Networks (Dual-GCN) with transformer and curriculum learning for image captioning. In particular, we not only use an object-level GCN to capture the object to object spatial relation within a single image, but also adopt an image-level GCN to capture the feature information provided by similar images. With the well-designed Dual-GCN, we can make the linguistic transformer better understand the relationship between different objects in a single image and make full use of similar images as auxiliary information to generate a reasonable caption description for a single image. Meanwhile, with a cross-review strategy introduced to determine difficulty levels, we adopt curriculum learning as the training strategy to increase the robustness and generalization of our proposed model. We conduct extensive experiments on the large-scale MS COCO dataset, and the experimental results powerfully demonstrate that our proposed method outperforms recent state-of-the-art approaches. It achieves a BLEU-1 score of 82.2 and a BLEU-2 score of 67.6. Our source code is available at {\em \color{magenta}{\url{https://github.com/Unbear430/DGCN-for-image-captioning}}}.
翻译:现有图像字幕方法只是侧重于了解单一图像中对象或实例之间的关系,而没有探索背景图像之间存在的关联关系。 在本文中,我们提议使用变压器和课程学习来为图像字幕提供变压器和课程说明。特别是,我们不仅使用目标级GCN来在一个图像中捕捉物体空间关系的对象,而且还采用图像级GCN来捕捉类似图像提供的特征信息。有了设计良好的双GCN,我们就可以使语言变压器更好地了解单一图像中不同对象之间的关系,并充分利用类似图像作为辅助信息,为单一图像生成合理的字幕描述。同时,通过交叉审查战略来确定困难级别,我们不仅将课程学习作为培训战略,以提高我们拟议模型的稳健性和普遍性,我们还对大型MS COCO数据集进行广泛的实验,实验结果有力地证明我们拟议的方法超越了最近的州-艺术方法。 它在BLEU-1为82/2和BLEO+O的BBBBS&BSBSQR