Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness due to the highly ill-posed nature and scarcity of audio-visual data. Existing works typically formulate the cross-modal mapping into a regression task, which suffers from the regression-to-mean problem leading to over-smoothed facial motions. In this paper, we propose to cast speech-driven facial animation as a code query task in a finite proxy space of the learned codebook, which effectively promotes the vividness of the generated motions by reducing the cross-modal mapping uncertainty. The codebook is learned by self-reconstruction over real facial motions and thus embedded with realistic facial motion priors. Over the discrete motion space, a temporal autoregressive model is employed to sequentially synthesize facial motions from the input speech signal, which guarantees lip-sync as well as plausible facial expressions. We demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. Also, a user study further justifies our superiority in perceptual quality.
翻译:语音驱动的3D面部动画已得到广泛研究,然而由于高度病态的性质和音频-视觉数据的稀缺性,实现真实感和生动感仍存在差距。现有工作通常将跨模式映射制定为回归任务,由于回归到平均问题,导致面部运动过度平滑,从而使得合成的面部运动真实感欠佳。在本文中,我们提出将语音驱动的面部动画作为有限代理空间中的代码查询任务,这有效地通过减少跨模式映射的不确定性来提高生成动作的生动感。该代码簿通过自我重构真实面部动作进行训练并因此嵌入真实面部运动先验。在离散动作空间上,使用时间自回归模型来从输入语音信号顺序合成面部动作,从而保证唇形同步和合理的面部表情。我们证明了我们的方法在定性和定量上都优于当前最先进的方法。此外,一项用户研究进一步证明了我们在感知质量上的优越性。