文本还不够: 将视觉压缩纳入 Open- domain 对话框生成 (Text is NOT Enough: Integrating Visual Impressions into Open-domain Dialogue Generation)

Open-domain dialogue generation in natural language processing (NLP) is by default a pure-language task, which aims to satisfy human need for daily communication on open-ended topics by producing related and informative responses. In this paper, we point out that hidden images, named as visual impressions (VIs), can be explored from the text-only data to enhance dialogue understanding and help generate better responses. Besides, the semantic dependency between an dialogue post and its response is complicated, e.g., few word alignments and some topic transitions. Therefore, the visual impressions of them are not shared, and it is more reasonable to integrate the response visual impressions (RVIs) into the decoder, rather than the post visual impressions (PVIs). However, both the response and its RVIs are not given directly in the test process. To handle the above issues, we propose a framework to explicitly construct VIs based on pure-language dialogue datasets and utilize them for better dialogue understanding and generation. Specifically, we obtain a group of images (PVIs) for each post based on a pre-trained word-image mapping model. These PVIs are used in a co-attention encoder to get a post representation with both visual and textual information. Since the RVIs are not provided directly during testing, we design a cascade decoder that consists of two sub-decoders. The first sub-decoder predicts the content words in response, and applies the word-image mapping model to get those RVIs. Then, the second sub-decoder generates the response based on the post and RVIs. Experimental results on two open-domain dialogue datasets show that our proposed approach achieves superior performance over competitive baselines.

翻译：自然语言处理( NLP) 的开放式对话框生成默认是一种纯语言的任务, 目的是通过生成相关和内容丰富的回复, 满足人类对开放式专题日常沟通的需求。在本文中, 我们指出, 以视觉印象命名的隐藏图像( VIs ) 可以从纯文本数据中探索, 以加强对话理解和帮助产生更好的回应。此外, 对话框及其响应之间的语义依赖性非常复杂, 例如, 字词对齐和某些主题转换。因此, 它们的视觉印象并不共享, 将响应视觉印象( RVIs ) 纳入解码器而不是后视像图像( PVIs ) 。但是, 测试过程中不会直接给出被称为视觉印象的图像( VI) 。为了处理上述问题, 我们提议一个框架, 以纯语言对话框数据集为基础明确构建VIs, 并利用它们来改进对话理解和生成。具体地说, 我们为每个邮件获取了一系列的图像( PVIs) 组合, 以预培训的文字对高级图像( RCO ) 后, 在测试期间, 这些图像将生成的图像显示为两个版本。