In this paper we propose a new conditional GAN for image captioning that enforces semantic alignment between images and captions through a co-attentive discriminator and a context-aware LSTM sequence generator. In order to train these sequence GANs, we empirically study two algorithms: Self-critical Sequence Training (SCST) and Gumbel Straight-Through. Both techniques are confirmed to be viable for training sequence GANs. However, SCST displays better gradient behavior despite not directly leveraging gradients from the discriminator. This ensures a stronger stability of sequence GANs training and ultimately produces models with improved results under human evaluation. Automatic evaluation of GAN trained captioning models is an open question. To remedy this, we introduce a new semantic score with strong correlation to human judgement. As a paradigm for evaluation, we suggest that the generalization ability of the captioner to Out of Context (OOC) scenes is an important criterion to assess generalization and composition. To this end, we propose an OOC dataset which, combined with our automatic metric of semantic score, is a new benchmark for the captioning community to measure the generalization ability of automatic image captioning. Under this new OOC benchmark, and on the traditional MSCOCO dataset, our models trained with SCST have strong performance in both semantic score and human evaluation.
翻译:在本文中,我们提出一个新的有条件的GAN, 用于图像说明, 通过共同注意的区分器和符合背景的 LSTM 序列生成器, 使图像和字幕之间在语义上更加一致。 为了培训这些序列 GANs, 我们从经验上研究两种算法: 自我批评序列训练(SST) 和 Gumbel Straty-Trough 。 这两种技术都被确认对培训序列 GANs 来说是可行的。 然而, SCST 显示更好的梯度行为, 尽管没有直接利用导师的梯度。 这确保了 GANs 培训的顺序更加稳定, 并最终产生了在人类评价下效果更好的模型。 对 GAN 培训的字幕模型自动评价是一个开放的问题。 为了纠正这一点, 我们引入了一种与人类判断有很强相关性的新的语义评分。 作为评价的范例, 我们建议, 超越背景的标注(OOC) 通用标注(OCSCO ), 与我们经过训练的硬性评分的硬性评分新基准。