The intelligent dialogue system, aiming at communicating with humans harmoniously with natural language, is brilliant for promoting the advancement of human-machine interaction in the era of artificial intelligence. With the gradually complex human-computer interaction requirements (e.g., multimodal inputs, time sensitivity), it is difficult for traditional text-based dialogue system to meet the demands for more vivid and convenient interaction. Consequently, Visual Context Augmented Dialogue System (VAD), which has the potential to communicate with humans by perceiving and understanding multimodal information (i.e., visual context in images or videos, textual dialogue history), has become a predominant research paradigm. Benefiting from the consistency and complementarity between visual and textual context, VAD possesses the potential to generate engaging and context-aware responses. For depicting the development of VAD, we first characterize the concepts and unique features of VAD, and then present its generic system architecture to illustrate the system workflow. Subsequently, several research challenges and representative works are detailed investigated, followed by the summary of authoritative benchmarks. We conclude this paper by putting forward some open issues and promising research trends for VAD, e.g., the cognitive mechanisms of human-machine dialogue under cross-modal dialogue context, and knowledge-enhanced cross-modal semantic interaction.
翻译:智能对话系统旨在与人类和谐地与自然语言沟通,在促进人工智能时代人类机器互动的推进方面是卓越的。随着人类计算机互动要求的逐渐复杂(例如多式联运投入、时间敏感性),传统的基于文本的对话系统难以满足更生动和方便的互动需求。因此,视觉背景增强对话系统(VAD)具有通过感知和理解多式信息(即图像或视频中的视觉背景、文本对话历史)与人类沟通的潜力,它已成为一个主要的研究范例。由于视觉和文字环境之间的一致性和互补性,VAD具有产生参与和背景意识反应的潜力。为了描述VAD的发展,我们首先描述VAD的概念和独特特征,然后介绍其通用系统结构以说明系统的工作流程。随后,对若干研究挑战和具有代表性的工作进行了详细调查,随后是权威基准的概述。我们通过提出一些开放的问题和有希望的研究趋势来完成这份文件,例如,在跨模式的对话中,即跨式对话的认知机制。