We study how to generate captions that are not only accurate in describing an image but also discriminative across different images. The problem is both fundamental and interesting, as most machine-generated captions, despite phenomenal research progresses in the past several years, are expressed in a very monotonic and featureless format. While such captions are normally accurate, they often lack important characteristics in human languages - distinctiveness for each caption and diversity for different images. To address this problem, we propose a novel conditional generative adversarial network for generating diverse captions across images. Instead of estimating the quality of a caption solely on one image, the proposed comparative adversarial learning framework better assesses the quality of captions by comparing a set of captions within the image-caption joint space. By contrasting with human-written captions and image-mismatched captions, the caption generator effectively exploits the inherent characteristics of human languages, and generates more discriminative captions. We show that our proposed network is capable of producing accurate and diverse captions across images.
翻译:我们研究如何生成不仅准确描述图像的字幕,而且对不同图像进行区分。问题既基本又有趣,因为尽管过去几年来研究进展惊人,但大多数机器生成的字幕都是以非常单调和无特色的格式表达的。虽然这些字幕通常很准确,但它们往往缺乏人文语言的重要特征,即每个字幕的独特性和不同图像的多样性。为了解决这一问题,我们建议建立一个创新的有条件的基因对抗网络,用于生成不同图像的字幕。我们提出的比较对抗性学习框架不是仅仅用一个图像来估计字幕的质量,而是通过比较图像雕刻联合空间内的一组字幕来更好地评估字幕的质量。通过与人文字幕和图像匹配的字幕进行比较,标题生成者有效地利用了人类语言的固有特征,并生成了更具歧视性的字幕。我们表明,我们提议的网络能够制作各种图像的准确和多样化字幕。