This paper presents a generic method for generating full facial 3D animation from speech. Existing approaches to audio-driven facial animation exhibit uncanny or static upper face animation, fail to produce accurate and plausible co-articulation or rely on person-specific models that limit their scalability. To improve upon existing models, we propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face. At the core of our approach is a categorical latent space for facial animation that disentangles audio-correlated and audio-uncorrelated information based on a novel cross-modality loss. Our approach ensures highly accurate lip motion, while also synthesizing plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion. We demonstrate that our approach outperforms several baselines and obtains state-of-the-art quality both qualitatively and quantitatively. A perceptual user study demonstrates that our approach is deemed more realistic than the current state-of-the-art in over 75% of cases. We recommend watching the supplemental video before reading the paper: https://research.fb.com/wp-content/uploads/2021/04/mesh_talk.mp4
翻译:本文介绍了一种通用方法,用于从演讲中生成完整的面部 3D 动画; 现有的由声音驱动的面部动动画展示出不光彩或静态的上脸动画,未能产生准确和可信的共同演示,或依赖限制其可缩放性的个人特有模型。 为了改进现有的模型,我们提议了一种由声音驱动的面部动动动画通用方法,为整个脸部取得高度现实的动作合成结果。 在我们的方法的核心是面部动动画的绝对潜在空间,它分解了以新颖的跨时尚损失为基础的与声音相关和与声音无关的信息。 我们的方法确保了高度准确的嘴部运动,同时还合成了与声音信号无关的面部的貌似动画,例如眼睛眨眼和眼睛眉毛运动。 我们证明我们的方法超越了几个基线,并获得了质量和数量两方面的状态。 一种概念用户研究表明,我们的方法被认为比超过75%的案例的当前状态更为现实。 我们建议在读纸之前的辅助性视频: http://remptobly/comstal。