Advances in the state of the art for 3d human sensing are currently limited by the lack of visual datasets with 3d ground truth, including multiple people, in motion, operating in real-world environments, with complex illumination or occlusion, and potentially observed by a moving camera. Sophisticated scene understanding would require estimating human pose and shape as well as gestures, towards representations that ultimately combine useful metric and behavioral signals with free-viewpoint photo-realistic visualisation capabilities. To sustain progress, we build a large-scale photo-realistic dataset, Human-SPACE (HSPACE), of animated humans placed in complex synthetic indoor and outdoor environments. We combine a hundred diverse individuals of varying ages, gender, proportions, and ethnicity, with hundreds of motions and scenes, as well as parametric variations in body shape (for a total of 1,600 different humans), in order to generate an initial dataset of over 1 million frames. Human animations are obtained by fitting an expressive human body model, GHUM, to single scans of people, followed by novel re-targeting and positioning procedures that support the realistic animation of dressed humans, statistical variation of body proportions, and jointly consistent scene placement of multiple moving people. Assets are generated automatically, at scale, and are compatible with existing real time rendering and game engines. The dataset with evaluation server will be made available for research. Our large-scale analysis of the impact of synthetic data, in connection with real data and weak supervision, underlines the considerable potential for continuing quality improvements and limiting the sim-to-real gap, in this practical setting, in connection with increased model capacity.
翻译:目前,由于缺少具有3个地面真理的视觉数据集,包括多人,在现实世界环境中活动,在现实世界环境中活动,有复杂的照明或隐蔽性,并有可能被移动的相机观察到,因此目前人类感知的先进程度有限。 光化的场景理解需要估计人类的外形和形状,以及手势,以便最终将有用的衡量和行为信号与自由视点摄影现实直观化能力相结合。 为了保持进步,我们建立了一个大规模摄影现实数据集,其中包括在复杂的合成质量和室外环境中工作的动人。 我们把100个不同年龄、性别、比例和族裔的不同个人与数百个运动和场景,以及身体形状的偏差进行估计(总共1 600个不同的人类),以便形成100多万个框架的初始数据集。 人类动画是通过一个直观的人体模型、GHUUM、对人进行单一的扫描,然后是新的重新定位和定位程序,以复杂的合成质量和室外环境为复杂的连接,从而支持将真实性数据与真实性数据比例进行连续排列,同时进行实时的模型分析。