The existing image captioning approaches typically train a one-stage sentence decoder, which is difficult to generate rich fine-grained descriptions. On the other hand, multi-stage image caption model is hard to train due to the vanishing gradient problem. In this paper, we propose a coarse-to-fine multi-stage prediction framework for image captioning, composed of multiple decoders each of which operates on the output of the previous stage, producing increasingly refined image descriptions. Our proposed learning approach addresses the difficulty of vanishing gradients during training by providing a learning objective function that enforces intermediate supervisions. Particularly, we optimize our model with a reinforcement learning approach which utilizes the output of each intermediate decoder's test-time inference algorithm as well as the output of its preceding decoder to normalize the rewards, which simultaneously solves the well-known exposure bias problem and the loss-evaluation mismatch problem. We extensively evaluate the proposed approach on MSCOCO and show that our approach can achieve the state-of-the-art performance.
翻译:现有的图像说明方法通常训练一个单阶段的句子解码器,这很难产生丰富的精细描述。 另一方面,多阶段图像说明模型由于渐变的梯度问题而难以培训。 在本文中,我们提议为图像说明提供一个粗到纯多阶段的预测框架,由多个解码器组成,每个解码器在前一个阶段的输出上运作,产生越来越精细的图像描述。我们提议的学习方法通过提供执行中间监督的学习客观功能来解决在培训期间消除梯度的困难。特别是,我们优化我们的模型,采用一种强化学习方法,利用每个中间解码器测试时间推断算法的输出,以及之前的解码器的输出,使奖赏正常化,同时解决众所周知的暴露偏差问题和损失评价错配问题。我们广泛评价了关于最低业务分类法的拟议方法,并表明我们的方法能够达到最先进的业绩。