Pose transfer of human videos aims to generate a high fidelity video of a target person imitating actions of a source person. A few studies have made great progress either through image translation with deep latent features or neural rendering with explicit 3D features. However, both of them rely on large amounts of training data to generate realistic results, and the performance degrades on more accessible internet videos due to insufficient training frames. In this paper, we demonstrate that the dynamic details can be preserved even trained from short monocular videos. Overall, we propose a neural video rendering framework coupled with an image-translation-based dynamic details generation network (D2G-Net), which fully utilizes both the stability of explicit 3D features and the capacity of learning components. To be specific, a novel texture representation is presented to encode both the static and pose-varying appearance characteristics, which is then mapped to the image space and rendered as a detail-rich frame in the neural rendering stage. Moreover, we introduce a concise temporal loss in the training stage to suppress the detail flickering that is made more visible due to high-quality dynamic details generated by our method. Through extensive comparisons, we demonstrate that our neural human video renderer is capable of achieving both clearer dynamic details and more robust performance even on accessible short videos with only 2k - 4k frames.
翻译:部分研究取得了巨大进展,要么通过具有深潜特征的图像翻译,要么通过直立的3D特征的神经功能,取得了巨大进展;然而,这两项研究都依赖大量培训数据来产生现实的结果,而由于培训框架不足,在更方便的互联网视频上性能也因培训框架不足而退化。在本文件中,我们证明动态细节甚至可以通过短片视频来保存。总体而言,我们提议一个神经视频传输框架,加上一个基于图像转换的动态细节生成网络(D2G-Net),充分利用清晰的3D特征的稳定性和学习组成部分的能力。具体地说,一个新型的纹质代表将静态和变形的外观特征编码起来,这些特征随后被绘制到图像空间,并作为精密的神经构造框架来制作。此外,我们在培训阶段引入了一个简明的时间损失,以抑制由于我们的方法生成的高质量动态细节而变得更显眼的细节(D2G-Net),我们通过广泛的比较,我们展示了更清晰的4个动态图像,我们只能用更清晰的图像来展示更清晰地展示我们的神经框架。