We present Dynamic Neural Portraits, a novel approach to the problem of full-head reenactment. Our method generates photo-realistic video portraits by explicitly controlling head pose, facial expressions and eye gaze. Our proposed architecture is different from existing methods that rely on GAN-based image-to-image translation networks for transforming renderings of 3D faces into photo-realistic images. Instead, we build our system upon a 2D coordinate-based MLP with controllable dynamics. Our intuition to adopt a 2D-based representation, as opposed to recent 3D NeRF-like systems, stems from the fact that video portraits are captured by monocular stationary cameras, therefore, only a single viewpoint of the scene is available. Primarily, we condition our generative model on expression blendshapes, nonetheless, we show that our system can be successfully driven by audio features as well. Our experiments demonstrate that the proposed method is 270 times faster than recent NeRF-based reenactment methods, with our networks achieving speeds of 24 fps for resolutions up to 1024 x 1024, while outperforming prior works in terms of visual quality.
翻译:我们展示了动态神经肖像, 这是一种新颖的方法。 我们的方法通过明确控制头部姿势、 面部表情和目视来生成摄影现实的视频肖像。 我们的拟议结构与现有方法不同, 使用GAN成像图像到图像转换网络将三维面孔的图像转换成光真图像。 相反, 我们用一个基于2D的协调式MLP 和可控动态来构建我们的系统。 我们采用基于2D的表示法的直觉, 而不是最近的3D NeRF 类似系统, 是因为视频肖像由单人固定相机拍摄, 因此, 我们只有单一的景象视角。 我们主要将我们的基因化模型设置在表达式混合形状上, 但我们显示我们的系统可以成功地由音频特性驱动。 我们的实验表明, 所拟议的方法比最近的基于 NERF 的重新反应方法快270倍, 我们的网络在分辨率达到1024英尺x1024英尺的速度, 而在视觉质量上比以前的作品要快。