We present a new listening head generation benchmark, for synthesizing responsive feedbacks of a listener (e.g., nod, smile) during a face-to-face conversation. As the indispensable complement to talking heads generation, listening head generation has seldomly been studied in literature. Automatically synthesizing listening behavior that actively responds to a talking head, is critical to applications such as digital human, virtual agents and social robots. In this work, we propose a novel dataset "ViCo", highlighting the listening head generation during a face-to-face conversation. A total number of 92 identities (67 speakers and 76 listeners) are involved in ViCo, featuring 483 clips in a paired "speaking-listening" pattern, where listeners show three listening styles based on their attitudes: positive, neutral, negative. Different from traditional speech-to-gesture or talking-head generation, listening head generation takes as input both the audio and visual signals from the speaker, and gives non-verbal feedbacks (e.g., head motions, facial expressions) in a real-time manner. Our dataset supports a wide range of applications such as human-to-human interaction, video-to-video translation, cross-modal understanding and generation. To encourage further research, we also release a listening head generation baseline, conditioning on different listening attitudes. Code & ViCo dataset: https://project.mhzhou.com/vico.
翻译:我们提出了一个新的监听头生成基准,用于在面对面的谈话中将听众(例如,点头、微笑)的响应性反馈综合在一起。作为谈话头一代不可或缺的补充,在文学中很少研究监听头一代。自动合成对谈话头一代积极反应的监听行为,对于数字人、虚拟代理人和社会机器人等应用至关重要。在这项工作中,我们提议建立一个新的数据集“ViCo”,在面对面的谈话中突出监听头一代。共有92个身份(67个发言者和76个听众)参与维科,其中483个剪辑以配对式的“听话”模式进行,听众根据他们的态度展示三种监听风格:积极、中立、消极。不同于传统的语音对声音或说话头版的生成,监听头一代将发言人的音频和视觉信号作为输入,并以实时方式提供非口头的反馈(例如,头部运动、面部表表达方式),共有483个剪辑剪辑。我们的数据设置支持将人际的视频和网络生成数据转换到网络的多种应用,例如:通过视频和视频对视频生成数据进行更深层次的翻版。我们的数据理解,还支持了对视频和视频对视频生成数据进行广泛的翻译。