神经渲染：从单目视频中呈现全新视角和姿态的人物 (Neural Rendering of Humans in Novel View and Pose from Monocular Video)

We introduce a new method that generates photo-realistic humans under novel views and poses given a monocular video as input. Despite the significant progress recently on this topic, with several methods exploring shared canonical neural radiance fields in dynamic scene scenarios, learning a user-controlled model for unseen poses remains a challenging task. To tackle this problem, we introduce an effective method to a) integrate observations across several frames and b) encode the appearance at each individual frame. We accomplish this by utilizing both the human pose that models the body shape as well as point clouds that partially cover the human as input. Our approach simultaneously learns a shared set of latent codes anchored to the human pose among several frames, and an appearance-dependent code anchored to incomplete point clouds generated by each frame and its predicted depth. The former human pose-based code models the shape of the performer whereas the latter point cloud-based code predicts fine-level details and reasons about missing structures at the unseen poses. To further recover non-visible regions in query frames, we employ a temporal transformer to integrate features of points in query frames and tracked body points from automatically-selected key frames. Experiments on various sequences of dynamic humans from different datasets including ZJU-MoCap show that our method significantly outperforms existing approaches under unseen poses and novel views given monocular videos as input.

翻译：我们介绍了一种新的方法，可以根据单目视频作为输入生成在全新视角和姿态下逼真的人物。尽管最近在这个领域取得了重大进展，有几种方法探索了动态场景中共享的标准神经辐射场，但学习一个用于未见姿态的用户控制模型仍然是一个具有挑战性的任务。为了解决这个问题，我们引入了一种有效的方法，即a) 融合数帧观察和b) 对每个帧的外观进行编码。我们通过使用建模身体形状的人体姿势以及部分覆盖人体的点云作为输入，实现了这一点。我们的方法同时学习了一个共享的潜在代码集，该集锚定在多个帧中的人体姿势上，以及一个锚定在每个帧的生成的不完整点云及其预测深度上的外观相关代码。前者人体姿势的基础代码模拟了执行者的形状，而后者点云基础代码则预测了细节，并考虑了未见姿态下的缺少结构。为了进一步恢复查询帧中的非可见区域，我们采用了一个时间转换器来整合查询帧中的点特征和从自动选择的关键帧跟踪的身体点特征。在包括ZJU-MoCap等不同数据集的各种动态人物序列的实验中，我们的方法在从单目视频中得到未见姿态和新视角的情况下，显著优于现有方法。