Automatic image captioning has recently approached human-level performance due to the latest advances in computer vision and natural language understanding. However, most of the current models can only generate plain factual descriptions about the content of a given image. However, for human beings, image caption writing is quite flexible and diverse, where additional language dimensions, such as emotion, humor and language styles, are often incorporated to produce diverse, emotional, or appealing captions. In particular, we are interested in generating sentiment-conveying image descriptions, which has received little attention. The main challenge is how to effectively inject sentiments into the generated captions without altering the semantic matching between the visual content and the generated descriptions. In this work, we propose two different models, which employ different schemes for injecting sentiments into image captions. Compared with the few existing approaches, the proposed models are much simpler and yet more effective. The experimental results show that our model outperform the state-of-the-art models in generating sentimental (i.e., sentiment-bearing) image captions. In addition, we can also easily manipulate the model by assigning different sentiments to the testing image to generate captions with the corresponding sentiments.
翻译:最近,由于计算机视觉和自然语言理解方面的最新进步,自动字幕接近了人类层面的性能。然而,目前大多数模型只能产生对特定图像内容的简单事实描述。然而,对于人类来说,图像字幕写作相当灵活和多样,其额外的语言层面,如情感、幽默和语言风格,往往被结合到其中,以产生多样化、情感或吸引性的字幕。特别是,我们有兴趣生成情绪调和图像描述,但这种描述很少引起注意。主要的挑战是如何有效地将情感注入生成的字幕中,而不改变视觉内容和生成描述之间的语义匹配。在这项工作中,我们提出了两种不同的模型,采用不同的方案将情感注入图像说明中。与现有的少数几种方法相比,拟议的模型更为简单而有效。实验结果表明,我们的模型在产生感性(即情感-情感-情感-情感-情感-情感)图像描述方面超越了最先进的模型。此外,我们还可以很容易地通过将不同的情感分配到测试图像上的不同情感,从而产生与对应情感的字幕来操纵模型。