Synthesizing realistic videos of humans using neural networks has been a popular alternative to the conventional graphics-based rendering pipeline due to its high efficiency. Existing works typically formulate this as an image-to-image translation problem in 2D screen space, which leads to artifacts such as over-smoothing, missing body parts, and temporal instability of fine-scale detail, such as pose-dependent wrinkles in the clothing. In this paper, we propose a novel human video synthesis method that approaches these limiting factors by explicitly disentangling the learning of time-coherent fine-scale details from the embedding of the human in 2D screen space. More specifically, our method relies on the combination of two convolutional neural networks (CNNs). Given the pose information, the first CNN predicts a dynamic texture map that contains time-coherent high-frequency details, and the second CNN conditions the generation of the final video on the temporally coherent output of the first CNN. We demonstrate several applications of our approach, such as human reenactment and novel view synthesis from monocular video, where we show significant improvement over the state of the art both qualitatively and quantitatively.
翻译:利用神经网络合成人类的现实视频,由于效率高,是传统基于图形的输送管道的流行替代物。现有的作品一般将之发展成2D屏幕空间的图像到图像翻译问题,这导致诸如过度移动、身体部位缺失和微小细节(如衣着中依赖表面的皱纹)的暂时不稳定性等艺术品。在本文中,我们提出了一个新的人类视频合成方法,通过明确分解从2D屏幕空间嵌入人类的时间一致的微小细节,来应对这些限制因素。更具体地说,我们的方法依赖于两个共振神经网络(CNNs)的组合。鉴于这些外观信息,第一部CNN预测了一个动态纹理图,其中含有时间相近的高频细节,以及第二部CNN在第一次CNN有时间一致性的产出上生成最后视频的条件。我们展示了我们方法的几种应用,例如人类重新反应和从单镜视频中新观点合成。我们在此展示了在质量和质量状况上的显著改进。