Speech-driven 3D face animation aims to generate realistic facial expressions that match the speech content and emotion. However, existing methods often neglect emotional facial expressions or fail to disentangle them from speech content. To address this issue, this paper proposes an end-to-end neural network to disentangle different emotions in speech so as to generate rich 3D facial expressions. Specifically, we introduce the emotion disentangling encoder (EDE) to disentangle the emotion and content in the speech by cross-reconstructed speech signals with different emotion labels. Then an emotion-guided feature fusion decoder is employed to generate a 3D talking face with enhanced emotion. The decoder is driven by the disentangled identity, emotional, and content embeddings so as to generate controllable personal and emotional styles. Finally, considering the scarcity of the 3D emotional talking face data, we resort to the supervision of facial blendshapes, which enables the reconstruction of plausible 3D faces from 2D emotional data, and contribute a large-scale 3D emotional talking face dataset (3D-ETF) to train the network. Our experiments and user studies demonstrate that our approach outperforms state-of-the-art methods and exhibits more diverse facial movements. We recommend watching the supplementary video: https://ziqiaopeng.github.io/emotalk
翻译:语音驱动的3D面部动画旨在生成与语音内容和情感相匹配的真实面部表情。然而,现有方法经常忽略情感面部表情或未能将其与语音内容分离。为了解决这个问题,本文提出了一种端到端的神经网络,以分离不同情感的语音,从而生成丰富的3D面部表情。具体而言,我们引入了情感分离编码器(EDE),通过用带有不同情感标签的交叉重构语音信号来分离语音中的情感和内容。然后,使用一个情感引导的特征融合解码器来生成带有增强情感的3D说话的面孔。解码器由分离的身份,情感和内容嵌入驱动,从而生成可控个人和情感样式。最后,考虑到3D情感说话面孔数据的稀缺性,我们依靠面部混合形状的监督,该监督能够从2D情感数据重建出合理的3D面孔,并贡献了一个大规模的3D情感说话面孔数据集(3D-ETF)来训练网络。我们的实验和用户研究表明,我们的方法优于最先进的方法,并展现出更多样化的面部动作。我们建议观看补充视频:https://ziqiaopeng.github.io/emotalk