Automatic image captioning has improved significantly over the last few years, but the problem is far from being solved, with state of the art models still often producing low quality captions when used in the wild. In this paper, we focus on the task of Quality Estimation (QE) for image captions, which attempts to model the caption quality from a human perspective and without access to ground-truth references, so that it can be applied at prediction time to detect low-quality captions produced on previously unseen images. For this task, we develop a human evaluation process that collects coarse-grained caption annotations from crowdsourced users, which is then used to collect a large scale dataset spanning more than 600k caption quality ratings. We then carefully validate the quality of the collected ratings and establish baseline models for this new QE task. Finally, we further collect fine-grained caption quality annotations from trained raters, and use them to demonstrate that QE models trained over the coarse ratings can effectively detect and filter out low-quality image captions, thereby improving the user experience from captioning systems.
翻译:在过去几年里,自动图像字幕的改进显著,但问题远未解决,因为最新艺术模型在野生使用时仍然经常产生低质量字幕。在本文中,我们侧重于图像字幕质量估计(QE)的任务,该任务试图从人的角度模拟字幕质量,而没有获得地面真伪参考,以便在预测时应用它来检测先前不为人知的图像上产生的低质量字幕。为此,我们开发了一个人类评估程序,收集来自众源用户的粗略字幕说明,然后用于收集覆盖600公里以上字幕质量评级的大型数据集。然后我们仔细验证所收集的评级质量,并为这一新 QE 任务建立基线模型。 最后,我们进一步收集受过培训的评级员的精细的字幕质量说明,并用它们来证明经过粗劣评级培训的量化模型能够有效检测和筛选出低质量的图像字幕,从而改进用户在字幕系统解释方面的经验。