Existing research for image captioning usually represents an image using a scene graph with low-level facts (objects and relations) and fails to capture the high-level semantics. In this paper, we propose a Theme Concepts extended Image Captioning (TCIC) framework that incorporates theme concepts to represent high-level cross-modality semantics. In practice, we model theme concepts as memory vectors and propose Transformer with Theme Nodes (TTN) to incorporate those vectors for image captioning. Considering that theme concepts can be learned from both images and captions, we propose two settings for their representations learning based on TTN. On the vision side, TTN is configured to take both scene graph based features and theme concepts as input for visual representation learning. On the language side, TTN is configured to take both captions and theme concepts as input for text representation re-construction. Both settings aim to generate target captions with the same transformer-based decoder. During the training, we further align representations of theme concepts learned from images and corresponding captions to enforce the cross-modality learning. Experimental results on MS COCO show the effectiveness of our approach compared to some state-of-the-art models.
翻译:用于图像字幕的现有研究通常是一种使用低层次事实(对象和关系)的景象图的图像,无法捕捉高层次语义学。 在本文中,我们提议了一个主题概念扩展图像描述框架(TCIC),其中包含主题概念,以代表高层次的跨模式语义。在实践中,我们以记忆矢量为主题概念模型,并用主题节点(TTN)提出变异器,以将这些矢量纳入图像描述。考虑到主题概念可以从图像和说明中学习,我们建议基于TTN的两个表达模式。在视觉方面,TTN配置将基于场景图的特征和主题概念作为视觉演示学习的投入。在语言方面,TTN配置将两个标题和主题概念都作为文字表达再构建的投入。这两个设置的目的是用相同的变异器解码生成目标说明。在培训中,我们进一步调整从图像和相应说明中学习的主题概念的表达方式,以实施跨模式学习。在MSCO模型上实验性结果显示我们状态的比较方法的有效性。