Audio-driven one-shot talking face generation methods are usually trained on video resources of various persons. However, their created videos often suffer unnatural mouth shapes and asynchronous lips because those methods struggle to learn a consistent speech style from different speakers. We observe that it would be much easier to learn a consistent speech style from a specific speaker, which leads to authentic mouth movements. Hence, we propose a novel one-shot talking face generation framework by exploring consistent correlations between audio and visual motions from a specific speaker and then transferring audio-driven motion fields to a reference image. Specifically, we develop an Audio-Visual Correlation Transformer (AVCT) that aims to infer talking motions represented by keypoint based dense motion fields from an input audio. In particular, considering audio may come from different identities in deployment, we incorporate phonemes to represent audio signals. In this manner, our AVCT can inherently generalize to audio spoken by other identities. Moreover, as face keypoints are used to represent speakers, AVCT is agnostic against appearances of the training speaker, and thus allows us to manipulate face images of different identities readily. Considering different face shapes lead to different motions, a motion field transfer module is exploited to reduce the audio-driven dense motion field gap between the training identity and the one-shot reference. Once we obtained the dense motion field of the reference image, we employ an image renderer to generate its talking face videos from an audio clip. Thanks to our learned consistent speaking style, our method generates authentic mouth shapes and vivid movements. Extensive experiments demonstrate that our synthesized videos outperform the state-of-the-art in terms of visual quality and lip-sync.
翻译:听力驱动的一发话面部生成方法通常以不同人士的视频资源来培训。然而,他们制作的视频往往有不正常的嘴部形状和无节制的嘴唇,因为这些方法很难从不同的发言者那里学习一致的语音风格。我们观察到,从一个特定发言者那里学习一致的语音风格会更容易得多,这会导致真正的口腔运动。因此,我们建议了一个新型的一发话式面部生成框架,探索特定发言者的音频和视觉动作之间的一致关联,然后将音频驱动动作字段转换为参考图像。具体地说,我们开发了一个音频-视觉调色素变异变变器(AVCT),目的是从一个输入音频的密集运动中推断出以基点为基础的音频调运动。特别是,考虑到声音可能来自不同的部署身份,我们将音频转换成音频变异的动作。这样,我们的AVCT可以将音频变音频变换到其他身份的音频。此外,我们用面键点代表发言者,AVCT对培训对象的面面面面面面面面面面面面面面面面面面面的面面面面面面面面面面面面图像图像图像的面,可以与培训的面部,从而减少以输入的音调的音频变的音音频变的音音音音频变的音音音音的音频变,我们的音的音音音音的音的音的音的音的音的音的音的音的音,我们的音的音的动作, 动作,我们的动作的动作的动作的动作的动作的动作,我们的动作的动作的动作的动作的动作, 将生成的动作将生成的动作将显示的动作将显示到一个磁的动作,我们的动作的动作的动作,我们的动作的动作的动作的动作的动作, 的变到一个磁的动作的动作的动作, 的动作的动作的动作,我们的变到一个的动作的动作的磁的磁的磁的变的动作的动作的动作的变的动作,我们的磁的变的动作的动作的动作, 的变的变的磁的变的变的变的变的磁的变的变的磁的磁的磁的变的变的变的动作的动作