Novel view synthesis for humans in motion is a challenging computer vision problem that enables applications such as free-viewpoint video. Existing methods typically use complex setups with multiple input views, 3D supervision, or pre-trained models that do not generalize well to new identities. Aiming to address these limitations, we present a novel view synthesis framework to generate realistic renders from unseen views of any human captured from a single-view sensor with sparse RGB-D, similar to a low-cost depth camera, and without actor-specific models. We propose an architecture to learn dense features in novel views obtained by sphere-based neural rendering, and create complete renders using a global context inpainting model. Additionally, an enhancer network leverages the overall fidelity, even in occluded areas from the original view, producing crisp renders with fine details. We show our method generates high-quality novel views of synthetic and real human actors given a single sparse RGB-D input. It generalizes to unseen identities, new poses and faithfully reconstructs facial expressions. Our approach outperforms prior human view synthesis methods and is robust to different levels of input sparsity.
翻译:对运动中的人类来说,新观点合成是一个具有挑战性的计算机视觉问题,它使得自由视野视频等应用成为了具有挑战性的计算机视觉问题。现有方法通常使用具有多种输入视图的复杂设置、3D监督或未经事先训练的模型,这些模型无法对新的身份进行全面概括。为了解决这些局限性,我们提出了一个新颖的视觉合成框架,以便从任何人类从一个与稀有 RGB-D 相近的单一视觉传感器中捕获的隐性观点中产生现实的转化,该传感器类似于一个低成本的深度相机,并且没有特定行为者的模型。我们提出了一个结构,以学习以基于球体的神经合成获得的新观点中的密集特征,并利用全球环境的涂料模型创建完整的合成。此外,一个增强者网络利用了总体忠诚性,甚至在与原始视图相隔绝的地区,产生精细的细节。我们的方法产生了合成合成合成和真实人类行为者的高质量新观点,这种观点与单一稀薄的RGB-D 投入相近。我们一般地介绍了以无形身份、新形象和忠实地重建面容表达方式。我们的方法超越了人类之前的合成合成方法,并稳健地形成了不同的输入层。