We propose a novel approach for few-shot talking-head synthesis. While recent works in neural talking heads have produced promising results, they can still produce images that do not preserve the identity of the subject in source images. We posit this is a result of the entangled representation of each subject in a single latent code that models 3D shape information, identity cues, colors, lighting and even background details. In contrast, we propose to factorize the representation of a subject into its spatial and style components. Our method generates a target frame in two steps. First, it predicts a dense spatial layout for the target image. Second, an image generator utilizes the predicted layout for spatial denormalization and synthesizes the target frame. We experimentally show that this disentangled representation leads to a significant improvement over previous methods, both quantitatively and qualitatively.
翻译:我们建议了一种新颖的方法,用于几发谈话头合成。 虽然神经说话头部最近的作品产生了令人乐观的结果, 但是它们仍然能够生成图像, 无法在源图像中保存主题的特性。 我们假设这是3D模型将信息、 身份提示、 颜色、 照明甚至背景细节混为一谈的结果。 相反, 我们提议将一个主题的表达方式纳入空间和风格组成部分。 我们的方法产生一个目标框架, 分为两步。 首先, 它预测目标图像的密集空间布局。 其次, 图像生成器使用预测的空间异常化布局, 并合成目标框架。 我们实验性地显示, 与先前的方法相比, 无论是在数量上还是质量上, 这种混杂的表达方式都有很大的改进。