Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data. Prior works typically focus on learning phoneme-level features of short audio windows with limited context, occasionally resulting in inaccurate lip movements. To tackle this limitation, we propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes. To cope with the data scarcity issue, we integrate the self-supervised pre-trained speech representations. Also, we devise two biased attention mechanisms well suited to this specific task, including the biased cross-modal multi-head (MH) attention and the biased causal MH self-attention with a periodic positional encoding strategy. The former effectively aligns the audio-motion modalities, whereas the latter offers abilities to generalize to longer audio sequences. Extensive experiments and a perceptual user study show that our approach outperforms the existing state-of-the-arts. The code will be made available.
翻译:3D 面部动画之所以具有挑战性,是因为人类面部的几何特征复杂,3D 视听数据有限。先前的工作通常侧重于学习背景有限的短音窗口的电话级特征,有时会导致不准确的嘴唇运动。为了应对这一限制,我们提议了一个基于变异器的自动递增模型,FaceFormer,该模型对长期音频环境进行编码,并自动递增地预测了3D 张动片的序列。为了应对数据稀缺问题,我们整合了自我监督的事先培训的语音演示。此外,我们设计了两种偏向关注机制,非常适合这一具体任务,包括偏向的跨模式多头(MH)关注和偏向性因果的 MH 自我关注,并采用定期定位编码战略。前者有效地调整了音频波模式,而后者则提供了将更长期的音频序列加以概括的能力。广泛的实验和感知性用户研究显示,我们的方法超越了现有的状态。代码将予使用。