Image captioning is shown to be able to achieve a better performance by using scene graphs to represent the relations of objects in the image. The current captioning encoders generally use a Graph Convolutional Net (GCN) to represent the relation information and merge it with the object region features via concatenation or convolution to get the final input for sentence decoding. However, the GCN-based encoders in the existing methods are less effective for captioning due to two reasons. First, using the image captioning as the objective (i.e., Maximum Likelihood Estimation) rather than a relation-centric loss cannot fully explore the potential of the encoder. Second, using a pre-trained model instead of the encoder itself to extract the relationships is not flexible and cannot contribute to the explainability of the model. To improve the quality of image captioning, we propose a novel architecture ReFormer -- a RElational transFORMER to generate features with relation information embedded and to explicitly express the pair-wise relationships between objects in the image. ReFormer incorporates the objective of scene graph generation with that of image captioning using one modified Transformer model. This design allows ReFormer to generate not only better image captions with the bene-fit of extracting strong relational image features, but also scene graphs to explicitly describe the pair-wise relation-ships. Experiments on publicly available datasets show that our model significantly outperforms state-of-the-art methods on image captioning and scene graph generation
翻译:显示图像字幕显示能够通过使用场景图解析图像中对象关系来取得更好的性能。 当前的标题编译器通常使用图表演变网( GCN) 来代表关系信息, 并通过连接或翻转将其与对象区域特性合并, 以便获得最终的解码输入。 然而, 现有方法中基于 GCN 的编码器由于两个原因对字幕来说不那么有效。 首先, 使用图像字幕作为目标( 即, 最大相似程度的模拟) 而不是以关系为中心的损失, 无法充分探索编码器的潜力。 其次, 使用预先训练的模型而不是编码器本身来代表关系并将其与对象区域特性合并, 无法对模型的解释性做出解释。 然而, 为了提高图像说明的质量, 我们提议了一个新的结构 Reformer -- -- 一种与嵌入内嵌的关联, 并明确表达图像状态对象之间的对比关系。 Reformer 将图像图表生成的目标与图像结构的更紧密性关系纳入, 并且使用一个已修改的图像缩略图绘制模型, 也允许使用一个可明显修改的图像模型生成模型, 将模型的模型的模型的模型的生成与一个更精确的图像构造关系 。