In this paper, we propose a novel conditional generative adversarial nets based image captioning framework as an extension of traditional reinforcement learning (RL) based encoder-decoder architecture. To deal with the inconsistent evaluation problem between objective language metrics and subjective human judgements, we are inspired to design some "discriminator" networks to automatically and progressively determine whether generated caption is human described or machine generated. Two kinds of discriminator architecture (CNN and RNN based structures) are introduced since each has its own advantages. The proposed algorithm is generic so that it can enhance any existing encoder-decoder based image captioning model and we show that conventional RL training method is just a special case of our framework. Empirically, we show consistent improvements over all language evaluation metrics for different stage-of-the-art image captioning models.
翻译:在本文中,我们提出一个新的有条件的对抗性网基图像说明框架,作为传统强化学习(RL)基于编码器-编码器结构的延伸。为了处理客观语言指标与主观人类判断之间不一致的评估问题,我们受启发设计了一些“差异器”网络,以便自动和逐步确定所生成的字幕是人类描述还是机器生成。两种歧视结构(CNN和基于RNN的架构)都是自有优势的。拟议的算法是通用的,可以加强任何现有的以编码器-编码器为基础的图像说明模型,我们显示常规的RL培训方法只是我们框架的一个特例。我们很生动地表明,对于不同阶段的图像说明模型,所有语言评价指标都得到了一致的改进。