In this paper, we present a video-based learning framework for animating personalized 3D talking faces from audio. We introduce two training-time data normalizations that significantly improve data sample efficiency. First, we isolate and represent faces in a normalized space that decouples 3D geometry, head pose, and texture. This decomposes the prediction problem into regressions over the 3D face shape and the corresponding 2D texture atlas. Second, we leverage facial symmetry and approximate albedo constancy of skin to isolate and remove spatio-temporal lighting variations. Together, these normalizations allow simple networks to generate high fidelity lip-sync videos under novel ambient illumination while training with just a single speaker-specific video. Further, to stabilize temporal dynamics, we introduce an auto-regressive approach that conditions the model on its previous visual state. Human ratings and objective metrics demonstrate that our method outperforms contemporary state-of-the-art audio-driven video reenactment benchmarks in terms of realism, lip-sync and visual quality scores. We illustrate several applications enabled by our framework.
翻译:在本文中,我们展示了一个基于视频的学习框架,以从音频中激发个性化的 3D 说话面孔。 我们引入了两个培训时间的数据正常化, 大大提高了数据样本效率。 首先, 我们分离并代表了在一个解开 3D 几何、 头部姿势和纹理的正常空间中面孔。 这将预测问题分解成3D 脸形和相应的 2D 纹理图集的倒退。 第二, 我们利用面部对称和皮肤近似反比振动粘合, 孤立并消除时空照明变化。 这些正常化使简单的网络能够在新的环境污染下生成高度忠实的双向同步视频, 而仅用单一的单个演讲者专用视频进行培训。 此外, 为了稳定时间动态, 我们引入了一种自动反向的方法, 将模型以其先前的视觉状态作为条件。 人类的评级和客观衡量尺度表明, 我们的方法在现实主义、 唇合成和视觉质量计分数框架中超越了当代状态驱动的视频再反应基准。 我们用了一些应用来说明。