We propose DeepMultiCap, a novel method for multi-person performance capture using sparse multi-view cameras. Our method can capture time varying surface details without the need of using pre-scanned template models. To tackle with the serious occlusion challenge for close interacting scenes, we combine a recently proposed pixel-aligned implicit function with parametric model for robust reconstruction of the invisible surface areas. An effective attention-aware module is designed to obtain the fine-grained geometry details from multi-view images, where high-fidelity results can be generated. In addition to the spatial attention method, for video inputs, we further propose a novel temporal fusion method to alleviate the noise and temporal inconsistencies for moving character reconstruction. For quantitative evaluation, we contribute a high quality multi-person dataset, MultiHuman, which consists of 150 static scenes with different levels of occlusions and ground truth 3D human models. Experimental results demonstrate the state-of-the-art performance of our method and the well generalization to real multiview video data, which outperforms the prior works by a large margin.
翻译:我们建议使用稀有的多视照相机进行多人性能捕捉的新型方法DeepMultiCap。 我们的方法可以捕捉不同时间的表面细节,而不需要使用预先扫描的模板模型。 为了应对密切互动场景面临的严重隔离挑战,我们将最近提议的像素匹配隐含功能与无形表面地区强力重建的参数模型结合起来。 一个有效的关注意识模块的设计是为了从多视图像中获取细微的几何细节,从中可以产生高知识效果。 除了空间关注方法外,对于视频输入,我们进一步提出了一种新的时间融合方法,以缓解移动字符重建的噪音和时间不一致性。 关于定量评估,我们贡献了一个高质量的多人数据集,它由150个静态场组成,不同层次的封闭和地面真相3D人类模型组成。实验结果展示了我们方法的状态和高知识性能,以及对真实多视视频数据的概括性,这比前一个大边缘的工程要好。