Building computer systems that can converse about their visual environment is one of the oldest concerns of research in Artificial Intelligence and Computational Linguistics (see, for example, Winograd's 1972 SHRDLU system). Only recently, however, have methods from computer vision and natural language processing become powerful enough to make this vision seem more attainable. Pushed especially by developments in computer vision, many data sets and collection environments have recently been published that bring together verbal interaction and visual processing. Here, we argue that these datasets tend to oversimplify the dialogue part, and we propose a task---MeetUp!---that requires both visual and conversational grounding, and that makes stronger demands on representations of the discourse. MeetUp! is a two-player coordination game where players move in a visual environment, with the objective of finding each other. To do so, they must talk about what they see, and achieve mutual understanding. We describe a data collection and show that the resulting dialogues indeed exhibit the dialogue phenomena of interest, while also challenging the language & vision aspect.
翻译:建立计算机系统可以改变其视觉环境,这是人工智能和计算语言系统研究的最老问题之一(例如,见Winograd的1972 SHRDLU系统)。然而,直到最近,计算机视觉和自然语言处理方法才变得足够强大,使得这种视觉更能实现。特别是由于计算机视觉的发展,最近公布了许多数据集和收集环境,将语言互动和视觉处理结合起来。我们在这里认为,这些数据集往往过分简化对话部分,我们建议一项任务——MeetUp!——既需要视觉,也需要对话基础,对演讲的表述提出更强烈的要求。MeetUp!这是一个双玩的协调游戏,玩家在视觉环境中移动,目的是相互寻找对方。要做到这一点,他们必须谈论他们所看到的,并实现相互理解。我们描述一个数据收集并表明,由此产生的对话确实展示了令人感兴趣的对话现象,同时也挑战语言和视觉方面。