Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness due to the highly ill-posed nature and scarcity of audio-visual data. Existing works typically formulate the cross-modal mapping into a regression task, which suffers from the regression-to-mean problem leading to over-smoothed facial motions. In this paper, we propose to cast speech-driven facial animation as a code query task in a finite proxy space of the learned codebook, which effectively promotes the vividness of the generated motions by reducing the cross-modal mapping uncertainty. The codebook is learned by self-reconstruction over real facial motions and thus embedded with realistic facial motion priors. Over the discrete motion space, a temporal autoregressive model is employed to sequentially synthesize facial motions from the input speech signal, which guarantees lip-sync as well as plausible facial expressions. We demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. Also, a user study further justifies our superiority in perceptual quality.
翻译:语音驱动的3D面部动画已被广泛研究,然而,由于视听数据高度不良和稀缺,实现现实主义和生动性仍有差距。现有作品通常将跨模式映射设计成回归任务,因为回归到问题,导致面部动作过度移动。在本文中,我们提议将语音驱动的面部动画作为一个代码查询任务投放到一个有限的代用空间,该代用空间通过减少跨模式绘图不确定性,有效地促进所产生动作的生动性。该代码手册通过对真实面部动作进行自我重建学习,并因此与现实面部动作前行相嵌。在离散运动空间上,使用一个时间自动递增模型按顺序综合输入语音信号的面部动画,这保证了唇合成以及表面表面的清晰表达方式。我们证明我们的方法在质量和数量上都超越了当前最先进的方法。此外,用户研究还进一步说明我们在视觉质量上的优越性。