The task of talking head generation is to synthesize a lip synchronized talking head video by inputting an arbitrary face image and audio clips. Most existing methods ignore the local driving information of the mouth muscles. In this paper, we propose a novel recurrent generative network that uses both audio and speech-related facial action units (AUs) as the driving information. AU information related to the mouth can guide the movement of the mouth more accurately. Since speech is highly correlated with speech-related AUs, we propose an Audio-to-AU module in our system to predict the speech-related AU information from speech. In addition, we use AU classifier to ensure that the generated images contain correct AU information. Frame discriminator is also constructed for adversarial training to improve the realism of the generated face. We verify the effectiveness of our model on the GRID dataset and TCD-TIMIT dataset. We also conduct an ablation study to verify the contribution of each component in our model. Quantitative and qualitative experiments demonstrate that our method outperforms existing methods in both image quality and lip-sync accuracy.
翻译:通过输入任意的脸部图像和音频剪辑来合成一个嘴唇同步说话的头部视频。大多数现有方法忽视了当地口腔肌肉的驱动信息。在本文中,我们提议建立一个新的反复出现的基因化网络,同时使用音频和言语相关面部动作单位作为驱动信息。非盟与口有关的信息可以更准确地指导口腔的移动。由于言论与言语相关,我们提议在我们的系统中建立一个音频至非盟模块,以预测通过发言获得的与非盟有关的信息。此外,我们利用非盟分类器确保生成的图像包含正确的非盟信息。框架歧视器也是为对抗性培训而建立的,以改善所生成面部的真实性。我们核查了我们关于全球资源数据库和TCD-TIM数据集模型的模型的有效性。我们还进行了一个模拟研究,以核实我们模型中每个组成部分的贡献。定量和定性实验表明,我们的方法在图像质量和唇合准确性两方面都超过了现有方法。