Existing approaches to image captioning usually generate the sentence word-by-word from left to right, with the constraint of conditioned on local context including the given image and history generated words. There have been many studies target to make use of global information during decoding, e.g., iterative refinement. However, it is still under-explored how to effectively and efficiently incorporate the future context. To respond to this issue, inspired by that Non-Autoregressive Image Captioning (NAIC) can leverage two-side relation with modified mask operation, we aim to graft this advance to the conventional Autoregressive Image Captioning (AIC) model while maintaining the inference efficiency without extra time cost. Specifically, AIC and NAIC models are first trained combined with shared visual encoders, forcing the visual encoder to contain sufficient and valid future context; then the AIC model is encouraged to capture the causal dynamics of cross-layer interchanging from NAIC model on its unconfident words, which follows a teacher-student paradigm and optimized with the distribution calibration training objective. Empirical evidences demonstrate that our proposed approach clearly surpass the state-of-the-art baselines in both automatic metrics and human evaluations on the MS COCO benchmark. The source code is available at: https://github.com/feizc/Future-Caption.
翻译:现有的图像字幕方法通常产生从左到右的逐字逐句的句式,但以当地情况为条件,包括给定的图像和历史生成的词句为限制条件。已经有许多研究目标是在解码过程中使用全球信息,例如迭代完善。然而,目前仍在探索如何有效和高效地纳入未来环境。为了应对这一问题,在非自动递减图像描述(NAIC)模式的启发下,可以利用修改后蒙面操作的双面关系,我们的目标是将这一进展转向常规的自动递增图像获取模型(AIC)模式,同时保持推断效率,而不增加时间成本。具体地说,AIC和NAIC模型首先与共同的视觉编码相结合,迫使视觉编码包含充分、有效的未来环境;然后鼓励AIC模式从NAIC模型的不自信词汇模式中获取跨层互换的因果关系动态,遵循教师学习模式,并与分配校准培训目标优化。Empricalalalalalalalital 证据显示,在IMFAR基准/CUD上,我们拟议的基准线/CSUI/CUD的自动评估。