We present a new end-to-end learning framework to obtain detailed and spatially coherent reconstructions of multiple people from a single image. Existing multi-person methods suffer from two main drawbacks: they are often model-based and therefore cannot capture accurate 3D models of people with loose clothing and hair; or they require manual intervention to resolve occlusions or interactions. Our method addresses both limitations by introducing the first end-to-end learning approach to perform model-free implicit reconstruction for realistic 3D capture of multiple clothed people in arbitrary poses (with occlusions) from a single image. Our network simultaneously estimates the 3D geometry of each person and their 6DOF spatial locations, to obtain a coherent multi-human reconstruction. In addition, we introduce a new synthetic dataset that depicts images with a varying number of inter-occluded humans and a variety of clothing and hair styles. We demonstrate robust, high-resolution reconstructions on images of multiple humans with complex occlusions, loose clothing and a large variety of poses and scenes. Our quantitative evaluation on both synthetic and real-world datasets demonstrates state-of-the-art performance with significant improvements in the accuracy and completeness of the reconstructions over competing approaches.
翻译:我们提出了一个新的端到端学习框架,以便从单一图像中获取对多种人的详细和空间一致的重建。现有的多人方法有两个主要缺陷:它们往往以模型为基础,因此无法捕捉衣着和头发松散的人的准确的三维模型;或者它们需要人工干预来解决隔离或互动问题。我们的方法通过采用第一个端到端的不隐含的重建方法来解决这两种限制,进行无模式的无端间接重建,以便从单一图像中现实地从一个图像中捕获身穿任意装(有隐蔽物的)多条衣人。我们的网络同时估计每个人及其6DOF空间位置的三维几何方法,以获得连贯的多人重建。此外,我们引入了新的合成数据集,描绘了不同人数的人类之间以及各种服装和发型的图像。我们展示了对多重人的图像进行强力、高分辨率的重建,这些图像具有复杂的包容、衣着松散以及大量的装装和场景色。我们对合成和现实世界数据集的定量评价显示现状,并显著地改进了全局性。