Current image captioning methods are usually trained via (penalized) maximum likelihood estimation. However, the log-likelihood score of a caption does not correlate well with human assessments of quality. Standard syntactic evaluation metrics, such as BLEU, METEOR and ROUGE, are also not well correlated. The newer SPICE and CIDEr metrics are better correlated, but have traditionally been hard to optimize for. In this paper, we show how to use a policy gradient (PG) method to directly optimize a linear combination of SPICE and CIDEr (a combination we call SPIDEr): the SPICE score ensures our captions are semantically faithful to the image, while CIDEr score ensures our captions are syntactically fluent. The PG method we propose improves on the prior MIXER approach, by using Monte Carlo rollouts instead of mixing MLE training with PG. We show empirically that our algorithm leads to easier optimization and improved results compared to MIXER. Finally, we show that using our PG method we can optimize any of the metrics, including the proposed SPIDEr metric which results in image captions that are strongly preferred by human raters compared to captions generated by the same model but trained to optimize MLE or the COCO metrics.
翻译:当前图像字幕方法通常通过(强制)最大可能性估计来培训。 但是, 标题的日志比值分与人类质量评估并不完全相关。 标准综合评估指标, 如 BLEU、 METEOR 和 ROUGE, 也并非完全相关。 最新的SPICE 和 CPIDER 测量指标比较好, 但传统上很难优化。 在本文中, 我们展示了如何使用政策梯度( PG) 方法直接优化SPICE 和 CIDER 的线性组合( 我们称之为SPIDEr 组合 ): SPICE 评分确保我们的标题与人的质量评估不相符。 标准综合评估指标( 如 BLE 、 METEOR 和 ROUUGE ) 也并非完全相关。 我们提出的PGE 方法改进了之前的 MIXER 方法, 而不是将 MLE 培训与 PG 混合起来。 我们从经验上表明, 我们的算法可以使SPICE 和 CIDER 的线性结果更加优化。 最后, 我们用我们的 PG 方法来优化任何指标衡量指标,,, 包括由所培训的SPIDELELE 的模型到最优化的模型比为最佳的模型。