预言人注意:预测未来注意改进图像描述的注意 (Prophet Attention: Predicting Attention with Future Attention for Improved Image Captioning)

Recently, attention based models have been used extensively in many sequence-to-sequence learning systems. Especially for image captioning, the attention based models are expected to ground correct image regions with proper generated words. However, for each time step in the decoding process, the attention based models usually use the hidden state of the current input to attend to the image regions. Under this setting, these attention models have a "deviated focus" problem that they calculate the attention weights based on previous words instead of the one to be generated, impairing the performance of both grounding and captioning. In this paper, we propose the Prophet Attention, similar to the form of self-supervision. In the training stage, this module utilizes the future information to calculate the "ideal" attention weights towards image regions. These calculated "ideal" weights are further used to regularize the "deviated" attention. In this manner, image regions are grounded with the correct words. The proposed Prophet Attention can be easily incorporated into existing image captioning models to improve their performance of both grounding and captioning. The experiments on the Flickr30k Entities and the MSCOCO datasets show that the proposed Prophet Attention consistently outperforms baselines in both automatic metrics and human evaluations. It is worth noticing that we set new state-of-the-arts on the two benchmark datasets and achieve the 1st place on the leaderboard of the online MSCOCO benchmark in terms of the default ranking score, i.e., CIDEr-c40.

翻译：最近,基于关注的模型在许多顺序到顺序的学习系统中被广泛使用。特别是对于图像说明,基于关注的模型预计将以正确的生成字词来定位正确的图像区域。然而,在解码过程中,基于关注的模型通常使用当前输入的隐藏状态来关注图像区域。在这种背景下,这些关注模型有一个“减轻重点”问题,即根据先前的字数而不是生成的字数来计算关注权重,从而损害地面和字幕的性能。在本文中,我们建议先知注意,类似于自我监督的形态。在培训阶段,这个模块利用未来的信息来计算图像区域的“理想”关注权重。这些计算“理想”的模型通常使用当前输入的隐藏状态来关注图像区域。在这种背景下,图像区域以正确的字数为基础计算出关注权重。拟议的先知关注可以很容易地纳入现有的描述模型,以提高其地面和字幕的性能。在Flick30k 实体的实验和MCCO 的在线基准值中,在1个日历基准值中,在1个预测基准中,在1个预测基准线上,在2个预测基准中,在2个预测标中,在2个预测标上,在1的预测标值上,在2个预测标值上,在1的标值上,在1的标值基准值中,在SB中,在2中,在2B上,在2中,在2中,在2B上,在2中,在2中显示。