This work addresses the problem of generating 3D holistic body motions from human speech. Given a speech recording, we synthesize sequences of 3D body poses, hand gestures, and facial expressions that are realistic and diverse. To achieve this, we first build a high-quality dataset of 3D holistic body meshes with synchronous speech. We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately. The separated modeling stems from the fact that face articulation strongly correlates with human speech, while body poses and hand gestures are less correlated. Specifically, we employ an autoencoder for face motions, and a compositional vector-quantized variational autoencoder (VQ-VAE) for the body and hand motions. The compositional VQ-VAE is key to generating diverse results. Additionally, we propose a cross-conditional autoregressive model that generates body poses and hand gestures, leading to coherent and realistic motions. Extensive experiments and user studies demonstrate that our proposed approach achieves state-of-the-art performance both qualitatively and quantitatively. Our novel dataset and code will be released for research purposes at https://talkshow.is.tue.mpg.de.
翻译:这项工作解决了从人说话中产生 3D 整体体运动的问题。 在语音记录中, 我们合成了 3D 体的序列、 手势和面部表达形式是现实的和多样的。 为了实现这一点, 我们首先建立一个由 3D 整体体模模头组成的高质量数据集, 配有同步的言语。 然后我们定义了一个新型的语音到动作生成框架, 将脸部、 身体和手部分别建模。 分离的模型源于一个事实, 即面部的表达与人的言语密切相关, 而身体的姿势和手势则不那么相干。 具体地说, 我们为面部和手部运动采用了自动编码, 以及组成矢量化的矢量化变式自动coder (VQ- VAE) 。 我们的构件 VQ- VAE 是产生不同结果的关键。 此外, 我们提出一个产生身体和手势的跨条件的自动递增模式, 导致连贯和现实的动作。 广泛的实验和用户研究表明, 我们提出的方法将实现质量和定量的状态表现。 我们的新数据和代码将发布。 。