Automatic synthesis of realistic co-speech gestures is an increasingly important yet challenging task in artificial embodied agent creation. Previous systems mainly focus on generating gestures in an end-to-end manner, which leads to difficulties in mining the clear rhythm and semantics due to the complex yet subtle harmony between speech and gestures. We present a novel co-speech gesture synthesis method that achieves convincing results both on the rhythm and semantics. For the rhythm, our system contains a robust rhythm-based segmentation pipeline to ensure the temporal coherence between the vocalization and gestures explicitly. For the gesture semantics, we devise a mechanism to effectively disentangle both low- and high-level neural embeddings of speech and motion based on linguistic theory. The high-level embedding corresponds to semantics, while the low-level embedding relates to subtle variations. Lastly, we build correspondence between the hierarchical embeddings of the speech and the motion, resulting in rhythm- and semantics-aware gesture synthesis. Evaluations with existing objective metrics, a newly proposed rhythmic metric, and human feedback show that our method outperforms state-of-the-art systems by a clear margin.
翻译:现实的共声手势的自动合成,在人工化剂生成方面,是一个日益重要但具有挑战性的任务。以前的系统主要侧重于以端到端的方式生成手势,由于言语和手势之间复杂而微妙的和谐,在挖掘清晰的节奏和语义方面造成困难。我们提出了一个新型的共声手势合成方法,在节奏和语义两方面都取得了令人信服的结果。对于节奏,我们的系统包含一个强大的基于节奏的分解管道,以确保声音和手势之间的时间一致性。对于手势语语义学,我们设计了一个机制,在语言理论基础上有效地分解低和高层次的言语和动作的神经嵌入。高级别嵌入与语义相对应,而低层次嵌入则与微妙的变异调有关。最后,我们在演讲和动作的分级嵌入层之间建立起了对应的通信,从而形成节奏和语义-认知的姿态合成。根据现有的客观指标、新提出的节度度度度度度度度度度度测量和人的反馈,显示我们的方法超越了清晰的状态。