Responsing with image has been recognized as an important capability for an intelligent conversational agent. Yet existing works only focus on exploring the multimodal dialogue models which depend on retrieval-based methods, but neglecting generation methods. To fill in the gaps, we first present a multimodal dialogue generation model, which takes the dialogue history as input, then generates a textual sequence or an image as response. Learning such a model often requires multimodal dialogues containing both texts and images which are difficult to obtain. Motivated by the challenge in practice, we consider multimodal dialogue generation under a natural assumption that only limited training examples are available. In such a low-resource setting, we devise a novel conversational agent, Divter, in order to isolate parameters that depend on multimodal dialogues from the entire generation model. By this means, the major part of the model can be learned from a large number of text-only dialogues and text-image pairs respectively, then the whole parameters can be well fitted using the limited training examples. Extensive experiments demonstrate our method achieves state-of-the-art results in both automatic and human evaluation, and can generate informative text and high-resolution image responses.
翻译:对图像作出反应被认为是智能对话媒介的重要能力。然而,现有工作的重点只是探索依赖检索方法的多式联运对话模式,而忽略了生成方法。为了填补空白,我们首先提出多式对话生成模式,将对话历史作为投入,然后生成文字序列或图像作为回应。学习这样一个模式往往需要包含难以获得的文本和图像的多式联运对话。受实践挑战的驱动,我们认为在自然假设只能提供有限的培训实例的情况下生成多式对话。在这种低资源环境中,我们设计了一个新的对话工具,即Divter,以便将依赖多式对话的参数与整个一代模式的模型分离开来。这样,该模式的主要部分可以分别从大量只使用文本的对话和文本图像配对中学习,这样整个参数就能够利用有限的培训实例加以完善。广泛的实验表明我们的方法在自动和人文评估中都取得了最新的结果,并且能够产生信息化的文本和高分辨率图像反应。