Image captioning has attracted ever-increasing research attention in the multimedia community. To this end, most cutting-edge works rely on an encoder-decoder framework with attention mechanisms, which have achieved remarkable progress. However, such a framework does not consider scene concepts to attend visual information, which leads to sentence bias in caption generation and defects the performance correspondingly. We argue that such scene concepts capture higher-level visual semantics and serve as an important cue in describing images. In this paper, we propose a novel scene-based factored attention module for image captioning. Specifically, the proposed module first embeds the scene concepts into factored weights explicitly and attends the visual information extracted from the input image. Then, an adaptive LSTM is used to generate captions for specific scene types. Experimental results on Microsoft COCO benchmark show that the proposed scene-based attention module improves model performance a lot, which outperforms the state-of-the-art approaches under various evaluation metrics.
翻译:图像字幕在多媒体界引起了越来越多的研究关注。 为此,大多数尖端作品都依赖于带有关注机制的编码器-编码器框架,并取得了显著的进展。然而,这样一个框架并不考虑视觉信息方面的现场概念,从而导致字幕生成中的判词偏差,相应的性能也有缺陷。我们认为,这些场景概念可以捕捉更高层次的视觉语义,并成为描述图像的重要提示。在本文中,我们提议为图像字幕建立一个基于新颖的场景要素关注模块。具体地说,拟议的模块首先将场景概念明确嵌入要素加权,并观看从输入图像中提取的视觉信息。然后,一个适应性 LSTM被用于生成特定场景类型的字幕。微软COCO基准的实验结果显示,拟议的场景关注模块大大改进了模型性能,超过了各种评价指标下的最新方法。