Most of the existing audio-driven 3D facial animation methods suffered from the lack of detailed facial expression and head pose, resulting in unsatisfactory experience of human-robot interaction. In this paper, a novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention. To synthesize real and detailed expression, a hierarchical decomposition strategy is proposed to encode the audio signal into both a global latent feature and a local vertex-wise control feature. Then the local and global audio features combined with vertex spatial features are used to predict the final consistent facial animation via a graph convolutional neural network by fusing the intrinsic spatial topology structure of the face model and the corresponding semantic feature of the audio. To accomplish pose-controllable animation, we introduce a novel pose attribute augmentation method by utilizing the 2D talking face technique. Experimental results indicate that the proposed method can produce more realistic facial expressions and head posture movements. Qualitative and quantitative experiments show that the proposed method achieves competitive performance against state-of-the-art methods.
翻译:现有大多数音频驱动的3D面部动画方法都因缺乏详细的面部表现和头部姿势而受到影响,导致人类-机器人互动的体验不尽人意。在本文中,通过使用声高调的声高端关注,提出了一个新的可控面部3D面部动画合成方法。为了合成真实和详细的表达方式,提出了等级分解战略,将音频信号编码成一种全球潜伏特征和一个局部的脊椎控制特征。然后,当地和全球的音频特征加上顶部空间特征,被用来通过图形相形神经网络预测最终的一致面部动画,方法是利用面部模型的内在空间表情结构以及音频的相应语义特征。为了实现可控面部动画,我们采用了一个新的可控面部增分化方法。实验结果表明,拟议的方法可以产生更现实的面部表达方式和头部姿势运动。定性和定量实验表明,拟议方法能够比国家艺术方法取得竞争性的性表现。</s>