Generating talking head videos through a face image and a piece of speech audio still contains many challenges. ie, unnatural head movement, distorted expression, and identity modification. We argue that these issues are mainly because of learning from the coupled 2D motion fields. On the other hand, explicitly using 3D information also suffers problems of stiff expression and incoherent video. We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio and implicitly modulates a novel 3D-aware face render for talking head generation. To learn the realistic motion coefficients, we explicitly model the connections between audio and different types of motion coefficients individually. Precisely, we present ExpNet to learn the accurate facial expression from audio by distilling both coefficients and 3D-rendered faces. As for the head pose, we design PoseVAE via a conditional VAE to synthesize head motion in different styles. Finally, the generated 3D motion coefficients are mapped to the unsupervised 3D keypoints space of the proposed face render, and synthesize the final video. We conduct extensive experiments to show the superior of our method in terms of motion and video quality.
翻译:通过脸部图像和语音音频片段生成有声头的头部视频仍然包含许多挑战。 即, 不正常的头部运动、 扭曲的表达方式和身份修改。 我们争论这些问题主要是从同时的 2D 运动场学习的。 另一方面, 明确使用 3D 信息也存在僵硬的表达和不相容的视频问题。 我们从音频和隐含的调制3DMMM 3D 动作系数( 头部、 表达方式) 产生3D 动作系数( 头部、 表达方式), 并隐含地调制出一个新的 3D 3D 运动面部面部面部组合。 为了了解现实的动作系数, 我们明确将声音和不同类型运动系数的连接成模型。 确切地说, 我们推出 ExpNet 来从音频中学习准确的面部表达方式, 通过提取系数和 3D 面部面部面部面部面部的图像。 我们通过一个有条件的 VAE 设计 PoseVAE 来以不同的方式合成头部运动。 最后的3D 将生成的3D 的3D 将3D 调调调的3D 调绘制成为不超强的3D, 并合成视频 。我们进行了广泛的实验, 以展示了我们高制式的视频的视频的图像的图像的图像的图像的图像的图像的图像的图像的图像的图像的演示制。