We propose the first approach to automatically and jointly synthesize both the synchronous 3D conversational body and hand gestures, as well as 3D face and head animations, of a virtual character from speech input. Our algorithm uses a CNN architecture that leverages the inherent correlation between facial expression and hand gestures. Synthesis of conversational body gestures is a multi-modal problem since many similar gestures can plausibly accompany the same input speech. To synthesize plausible body gestures in this setting, we train a Generative Adversarial Network (GAN) based model that measures the plausibility of the generated sequences of 3D body motion when paired with the input audio features. We also contribute a new way to create a large corpus of more than 33 hours of annotated body, hand, and face data from in-the-wild videos of talking people. To this end, we apply state-of-the-art monocular approaches for 3D body and hand pose estimation as well as dense 3D face performance capture to the video corpus. In this way, we can train on orders of magnitude more data than previous algorithms that resort to complex in-studio motion capture solutions, and thereby train more expressive synthesis algorithms. Our experiments and user study show the state-of-the-art quality of our speech-synthesized full 3D character animations.
翻译:我们提出第一个自动和联合合成3D对口机和手势以及3D对面和头动画的自动和联合方法,这些对立面和手势都是来自语音输入的虚拟字符。我们的算法使用CNN结构来利用面部表达和手势之间的内在关联。 对话身体动作的合成是一个多式问题,因为许多相似的手势可以令人相信地伴随同样的输入演讲。 为了综合在这个环境中的貌似身体动作,我们训练了一个基于Generalive Aversarial网络(GAN)的模型,用来测量3D对面运动生成的序列与输入音频特性的相匹配的可观性。 我们还用新的方法创建了一个长达33小时以上的有注释的身体、手和手势手势动作动作之间的大量数据。 我们为此对3D身体和手姿势进行最先进的单体方法, 以及3D脸部的性能捕捉到的功能。 通过这种方式,我们可以对比之前的3D身体运动动作动作动作动作的序列进行更多的数据排序。 我们的算法和动动式的演算法—— 更复杂地展示了我们的演进式的演的演算。