In contrast to the traditional avatar creation pipeline which is a costly process, contemporary generative approaches directly learn the data distribution from photographs. While plenty of works extend unconditional generative models and achieve some levels of controllability, it is still challenging to ensure multi-view consistency, especially in large poses. In this work, we propose a network that generates 3D-aware portraits while being controllable according to semantic parameters regarding pose, identity, expression and illumination. Our network uses neural scene representation to model 3D-aware portraits, whose generation is guided by a parametric face model that supports explicit control. While the latent disentanglement can be further enhanced by contrasting images with partially different attributes, there still exists noticeable inconsistency in non-face areas, e.g., hair and background, when animating expressions. Wesolve this by proposing a volume blending strategy in which we form a composite output by blending dynamic and static areas, with two parts segmented from the jointly learned semantic field. Our method outperforms prior arts in extensive experiments, producing realistic portraits with vivid expression in natural lighting when viewed from free viewpoints. It also demonstrates generalization ability to real images as well as out-of-domain data, showing great promise in real applications.
翻译:与传统的天体造影管道相比,现代基因方法直接从照片中学习数据分布。 虽然大量工程扩展了无条件基因模型,并实现了一定的控制性水平,但确保多视角一致性,特别是在大面部方面,仍具有挑战性。 在这项工作中,我们提议建立一个网络,产生3D-天体肖像,同时根据关于人形、身份、表达和光化的语义参数加以控制。我们的网络将神经场景显示用于模型3D-天体肖像,其生成以支持明确控制的参数面部模型为指导。虽然大量工程扩展了无条件基因模型,并实现了某种程度的控制,但通过部分不同属性的图像对比可以进一步加强了潜在分解。在非面领域,例如头发和背景,在动画表达时,仍然存在着明显的不一致。我们提出一个量混合战略,通过混合动态和静态区域形成一种复合产出,其中两个部分与共同学习的文体场区相分离。我们的方法在广泛的实验中超越了以前的艺术,在真实的图像中产生现实的肖像,在自然光中以真实的表达能力,从自由观点显示真实的图像。