A well-designed interactive human-like dialogue system is expected to take actions (e.g. smiling) and respond in a pattern similar to humans. However, due to the limitation of single-modality (only speech) or small volume of currently public datasets, most dialogue systems can only respond in speech and cannot take human-like actions. In this work, we build a large-scale multi-modal dataset of human-to-human conversation in a face-to-face fashion, with fine-grained annotations. The raw data in video format contains 635 dialogue sessions, being collected from 200 participants on designed topics and lasting 52 hours in total. Moreover, we manually annotated the verbal and non-verbal behaviors in each dialogue session on their start/end timestamp. Furthermore, we developed a corresponding evaluation tool for human-like dialogue systems to automatically evaluates the accuracy of two basic tasks, turn-taking prediction, and backchannel prediction, on both time and content. We have opened the data, the tools will be released at the conference.
翻译:设计完善的互动式人文对话系统预计将采取行动(例如微笑),以与人类相似的方式作出反应。然而,由于单一模式(仅是言语)有限或目前公共数据集数量小,大多数对话系统只能用言语回应,不能采取类似人类的行动。在这项工作中,我们以面对面的方式,建立大规模多模式的人类对人对话数据集,并配有细微的注释。视频格式的原始数据包含635次对话会,由200名参与者收集,内容是设计的主题,总共长达52小时。此外,我们手动为每次对话会的开始/结束时间印上口头和非口头行为说明。此外,我们开发了一种对人文对话系统的相应评价工具,以自动评价两种基本任务的准确性,即时间和内容的翻转预测和回声预报。我们打开了数据,工具将在会议上发布。