Talking head generation is to synthesize a lip-synchronized talking head video by inputting an arbitrary face image and corresponding audio clips. Existing methods ignore not only the interaction and relationship of cross-modal information, but also the local driving information of the mouth muscles. In this study, we propose a novel generative framework that contains a dilated non-causal temporal convolutional self-attention network as a multimodal fusion module to promote the relationship learning of cross-modal features. In addition, our proposed method uses both audio- and speech-related facial action units (AUs) as driving information. Speech-related AU information can guide mouth movements more accurately. Because speech is highly correlated with speech-related AUs, we propose an audio-to-AU module to predict speech-related AU information. We utilize pre-trained AU classifier to ensure that the generated images contain correct AU information. We verify the effectiveness of the proposed model on the GRID and TCD-TIMIT datasets. An ablation study is also conducted to verify the contribution of each component. The results of quantitative and qualitative experiments demonstrate that our method outperforms existing methods in terms of both image quality and lip-sync accuracy.
翻译:在这项研究中,我们提出了一个新的基因框架,其中包含一个非因果的超时演进自留网络,作为多式组合模块,以促进跨模式特征的关系学习。此外,我们提议的方法同时使用音频和与语音有关的面部动作单位(AUS)作为驱动信息的工具。与发言有关的AU信息可以更准确地指导口腔运动。由于语音与AU高度相关,我们提议了一个音频到AU模块,以预测与AU有关的声音信息。我们使用事先培训的AU分类器,以确保生成的图像包含正确的AU信息。我们核查了全球资源数据库和TCD-TIMIT数据集的拟议模型的有效性。还进行了一项模拟研究,以核实每个组件的贡献。定量和定性实验的结果表明,我们的方法在质量和图像方面都超过了现有方法的准确性。