Image captioning is one of the most challenging tasks in AI, which aims to automatically generate textual sentences for an image. Recent methods for image captioning follow encoder-decoder framework that transforms the sequence of salient regions in an image into natural language descriptions. However, these models usually lack the comprehensive understanding of the contextual interactions reflected on various visual relationships between objects. In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning. Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information. Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers (Region BERT) without extra relational annotations. To evaluate the effectiveness and superiority of our proposed method, we conduct extensive experiments on Microsoft COCO benchmark and achieve remarkable improvements compared with strong baselines.
翻译:图像字幕是AI 中最具挑战性的任务之一, AI 旨在自动生成图像文字句。 图像字幕的近期方法遵循将图像中突出区域的顺序转换为自然语言描述的编码器- 编码器框架。 但是, 这些模型通常缺乏对不同对象之间视觉关系中反映的背景互动的全面理解。 在本文中, 我们探索明确和隐含的视觉关系, 以丰富图像字幕的区域层面表达方式。 清晰地说, 我们在对象配对上建立语义图, 并利用 Gated GCN 门形图形组合网络( GCN ) 来有选择地汇总本地邻居的信息。 隐含地说, 我们通过基于区域且无额外关联说明的变异器( Region BERT) 的双向编码器演示, 来吸引被探测到的物体之间的全球互动。 为了评估我们拟议方法的有效性和优越性, 我们进行了关于微软COCO基准的广泛实验, 并实现了与强健基线的显著改进。