Capturing and faithfully rendering photo-realistic humans from novel views is a fundamental problem for AR/VR applications. While prior work has shown impressive performance capture results in laboratory settings, it is non-trivial to achieve casual free-viewpoint human capture and rendering for unseen identities with high fidelity, especially for facial expressions, hands, and clothes. To tackle these challenges we introduce a novel view synthesis framework that generates realistic renders from unseen views of any human captured from a single-view and sparse RGB-D sensor, similar to a low-cost depth camera, and without actor-specific models. We propose an architecture to create dense feature maps in novel views obtained by sphere-based neural rendering, and create complete renders using a global context inpainting model. Additionally, an enhancer network leverages the overall fidelity, even in occluded areas from the original view, producing crisp renders with fine details. We show that our method generates high-quality novel views of synthetic and real human actors given a single-stream, sparse RGB-D input. It generalizes to unseen identities, and new poses and faithfully reconstructs facial expressions. Our approach outperforms prior view synthesis methods and is robust to different levels of depth sparsity.
翻译:从新观点中获取并忠实地拍摄摄影现实的人类,是AR/VR应用中一个根本性的问题。虽然先前的工作显示实验室环境中令人印象深刻的性能捕捉结果,但在实验室环境中,实现随意自由观察人类捕捉和以高度忠贞地制作秘密身份,特别是面部表情、手和衣着,是非边际的。为了应对这些挑战,我们引入了一个新颖的视觉合成综合框架,从任何人类从单一视角和稀疏的 RGB-D 传感器所捕捉的不可见的观点中产生现实结果,类似于低成本深水照相机,并且没有特定行为者的模型。我们提出了一个建筑,在以基于球体的神经造色的新观点中绘制密集的地貌图,并利用全球背景的绘画模型来创造完整的图。此外,一个增强型网络利用了总体忠诚性,甚至在原始观点的隐蔽区,生成了精细的细节。我们的方法产生了合成和真实的人类行为者的高质量新观点,而单一流、稀薄的RGB-D 输入。我们提出了一种结构,它概括了以隐形身份,而新的面貌和真实的深度面面面面面面面貌是前制的正确的方法。