Image-based multi-person reconstruction in wide-field large scenes is critical for crowd analysis and security alert. However, existing methods cannot deal with large scenes containing hundreds of people, which encounter the challenges of large number of people, large variations in human scale, and complex spatial distribution. In this paper, we propose Crowd3D, the first framework to reconstruct the 3D poses, shapes and locations of hundreds of people with global consistency from a single large-scene image. The core of our approach is to convert the problem of complex crowd localization into pixel localization with the help of our newly defined concept, Human-scene Virtual Interaction Point (HVIP). To reconstruct the crowd with global consistency, we propose a progressive reconstruction network based on HVIP by pre-estimating a scene-level camera and a ground plane. To deal with a large number of persons and various human sizes, we also design an adaptive human-centric cropping scheme. Besides, we contribute a benchmark dataset, LargeCrowd, for crowd reconstruction in a large scene. Experimental results demonstrate the effectiveness of the proposed method. The code and datasets will be made public.
翻译:在广域大场景中进行基于图像的多人重建对于人群分析和安全警报至关重要。 但是,现有方法无法处理包含数百人的大型场景,这些场景面临着众多人的挑战,人类规模差异很大,空间分布复杂。 在本文中,我们提出Crowd3D,这是从一个大屏幕图像中以全球一致的方式重建成三维构成、形状和位置的第一个框架。我们的方法的核心是,在我们新定义的概念,即人类-cene虚拟互动点(HVIP)的帮助下,将复杂的人群本地化问题转换成像素本地化。为了以全球一致性的方式重建人群,我们提议以HVIP为基础的渐进重建网络,预先估计现场一级的相机和地面飞机。为了处理大量的人和各种大小的人,我们还设计一个适应性的以人类为中心的作物种植计划。此外,我们还为大型场景群重建贡献了一个基准数据集,即大型群落。实验结果显示了拟议方法的有效性。代码和数据集将公诸于众。