预言家注意力：以未来关注为基础预测图像字幕的关注 (Prophet Attention: Predicting Attention with Future Attention for Image Captioning)

Recently, attention based models have been used extensively in many sequence-to-sequence learning systems. Especially for image captioning, the attention based models are expected to ground correct image regions with proper generated words. However, for each time step in the decoding process, the attention based models usually use the hidden state of the current input to attend to the image regions. Under this setting, these attention models have a "deviated focus" problem that they calculate the attention weights based on previous words instead of the one to be generated, impairing the performance of both grounding and captioning. In this paper, we propose the Prophet Attention, similar to the form of self-supervision. In the training stage, this module utilizes the future information to calculate the "ideal" attention weights towards image regions. These calculated "ideal" weights are further used to regularize the "deviated" attention. In this manner, image regions are grounded with the correct words. The proposed Prophet Attention can be easily incorporated into existing image captioning models to improve their performance of both grounding and captioning. The experiments on the Flickr30k Entities and the MSCOCO datasets show that the proposed Prophet Attention consistently outperforms baselines in both automatic metrics and human evaluations. It is worth noticing that we set new state-of-the-arts on the two benchmark datasets and achieve the 1st place on the leaderboard of the online MSCOCO benchmark in terms of the default ranking score, i.e., CIDEr-c40.

翻译：最近，注意力模型在许多序列到序列学习系统中被广泛使用，特别是在图像字幕生成任务中，期望使用这些模型将正确的图像区域与适当的生成单词联系起来。然而，在解码过程中的每个时间步骤，基于注意力的模型通常使用当前输入的隐藏状态来关注图像区域。在这种设置下，这些注意力模型存在“偏离焦点”的问题，即根据前面生成的单词而不是要生成的单词计算注意力权重，从而影响对词汇选择的正确性和图像区域的正确性。在本文中，我们提出了一种类似于自我监督的Prophet注意力。在训练阶段，该模块利用未来信息来计算被称为“理想”的关注权重。这些计算出的“理想”权重进一步被用于规范“偏离”的关注。这样，图像区域可以与正确的单词相连。Prophe tAttention可以轻松地与现有的图像字幕模型结合使用，以提高它们在 grounding 和 captioning 方面的性能。在Flickr30k Entitie和MSCOCO数据集上的实验表明，Prophe tAttention在自动指标和人工评估方面均优于基线。值得注意的是，我们在这两个基准数据集上设置了新的最高水平，并以默认排名得分即CIDEr-c40的第一名在MSCOCO在线基准测试的排行榜上获得第一名。