Linguistic style is an essential part of written communication, with the power to affect both clarity and attractiveness. With recent advances in vision and language, we can start to tackle the problem of generating image captions that are both visually grounded and appropriately styled. Existing approaches either require styled training captions aligned to images or generate captions with low relevance. We develop a model that learns to generate visually relevant styled captions from a large corpus of styled text without aligned images. The core idea of this model, called SemStyle, is to separate semantics and style. One key component is a novel and concise semantic term representation generated using natural language processing techniques and frame semantics. In addition, we develop a unified language model that decodes sentences with diverse word choices and syntax for different styles. Evaluations, both automatic and manual, show captions from SemStyle preserve image semantics, are descriptive, and are style shifted. More broadly, this work provides possibilities to learn richer image descriptions from the plethora of linguistic data available on the web.
翻译:语言风格是书面交流的重要组成部分,具有影响清晰度和吸引力的力量。随着视觉和语言的最新进步,我们可以开始解决生成视觉基础和适当风格的图像说明的问题。 现有的方法要么需要根据图像调整的样式化培训说明, 要么产生低相关性的符号。 我们开发了一个模型, 学会从大量没有统一图像的样式化文本中生成与视觉相关的样式化说明。 这个模型的核心理念叫做SemStyle, 是将语义和风格分开。 一个关键组成部分是使用自然语言处理技术和框架语义学生成的新颖和简洁的语义表达法。 此外, 我们开发了一个统一的语言模型, 解码不同风格的词汇选择和语法。 评估, 包括自动和手动, 显示SemSTyer保存图像语义学的图表, 是描述, 并且是风格的转变。 更广泛地说, 这项工作提供了从网络上现有的大量语言数据中学习更丰富的图像描述的可能性。