Human reconstruction and synthesis from monocular RGB videos is a challenging problem due to clothing, occlusion, texture discontinuities and sharpness, and framespecific pose changes. Many methods employ deferred rendering, NeRFs and implicit methods to represent clothed humans, on the premise that mesh-based representations cannot capture complex clothing and textures from RGB, silhouettes, and keypoints alone. We provide a counter viewpoint to this fundamental premise by optimizing a SMPL+D mesh and an efficient, multi-resolution texture representation using only RGB images, binary silhouettes and sparse 2D keypoints. Experimental results demonstrate that our approach is more capable of capturing geometric details compared to visual hull, mesh-based methods. We show competitive novel view synthesis and improvements in novel pose synthesis compared to NeRF-based methods, which introduce noticeable, unwanted artifacts. By restricting the solution space to the SMPL+D model combined with differentiable rendering, we obtain dramatic speedups in compute, training times (up to 24x) and inference times (up to 192x). Our method therefore can be used as is or as a fast initialization to NeRF-based methods.
翻译:由于服装、封闭性、质地不相干和锐利以及框架特质,人类的重建和合成是一个具有挑战性的问题。许多方法都采用推迟制成、内RFs和隐含方法来代表衣着人,前提是网状表象无法单独捕捉RGB、双光带和关键点的复杂服装和纹理。我们通过优化SMPL+D网和高效、多分辨率的纹理代表,优化SMPL+D网和高效、多分辨率的纹理代表,提供了与此基本前提相反的观点。我们只使用RGB图像、双光环和稀释的 2D 关键点,实验结果显示我们的方法比视身、网状方法更有能力捕捉几何细节。我们展示了与NERF方法相比的新型合成具有竞争性的新观点合成和改良,后者引入了明显、需要的文物。我们通过将解决方案空间限制在SMPL+D模型中,加上不同的内容,我们只能用快速的速度、培训时间(至24x)和推导时间(至192x)。因此,我们的方法可以被快速地用于初始。</s>