The goal of this paper is to synthesise talking faces with controllable facial motions. To achieve this goal, we propose two key ideas. The first is to establish a canonical space where every face has the same motion patterns but different identities. The second is to navigate a multimodal motion space that only represents motion-related features while eliminating identity information. To disentangle identity and motion, we introduce an orthogonality constraint between the two different latent spaces. From this, our method can generate natural-looking talking faces with fully controllable facial attributes and accurate lip synchronisation. Extensive experiments demonstrate that our method achieves state-of-the-art results in terms of both visual quality and lip-sync score. To the best of our knowledge, we are the first to develop a talking face generation framework that can accurately manifest full target facial motions including lip, head pose, and eye movements in the generated video without any additional supervision beyond RGB video with audio.
翻译:本文旨在合成具有可控制面部动作的说话脸部。为达到这个目标,我们提出了两个关键的想法。第一个是建立一个规范空间,在这个空间中,每个面部具有相同的运动模式,但不同的身份。第二个是在多模式运动空间进行导航,该空间仅表示与运动相关的特征,同时消除身份信息。为了将身份和运动分离,我们在两种不同的潜在空间之间引入正交性约束。从这个空间中,我们的方法可以生成自然的说话脸部,具有完全可控的面部属性和准确的唇部同步。大量实验证明,我们的方法在视觉质量和唇同步得分方面均达到了最先进的水平。据我们所知,我们是第一个开发出可以在生成的视频中准确展现全目标面部动作,包括唇部、头部姿势和眼睛移动,而没有任何额外的监督,仅仅使用RGB视频和音频作为输入的说话脸部生成框架。