Image- and video-based 3D human recovery (i.e. pose and shape estimation) have achieved substantial progress. However, due to the prohibitive cost of motion capture, existing datasets are often limited in scale and diversity, which hinders the further development of more powerful models. In this work, we obtain massive human sequences as well as their 3D ground truths by playing video games. Specifically, we contribute, GTA-Human, a mega-scale and highly-diverse 3D human dataset generated with the GTA-V game engine. With a rich set of subjects, actions, and scenarios, GTA-Human serves as both an effective training source. Notably, the "unreasonable effectiveness of data" phenomenon is validated in 3D human recovery using our game-playing data. A simple frame-based baseline trained on GTA-Human already outperforms more sophisticated methods by a large margin; for video-based methods, GTA-Human demonstrates superiority over even the in-domain training set. We extend our study to larger models to observe the same consistent improvements, and the study on supervision signals suggests the rich collection of SMPL annotations is key. Furthermore, equipped with the diverse annotations in GTA-Human, we systematically investigate the performance of various methods under a wide spectrum of real-world variations, e.g. camera angles, poses, and occlusions. We hope our work could pave way for scaling up 3D human recovery to the real world.
翻译:以图像和视频为基础的3D人类恢复(即成形和形状估计)取得了巨大进展,然而,由于动作捕获成本高得令人望而却步,现有的数据集在规模和多样性上往往有限,阻碍了更强大的模型的进一步发展。在这项工作中,我们通过玩电子游戏获得了大量的人类序列及其3D地面真象。具体地说,我们为GTA-Hen作出了贡献,这是一个由GTA-V游戏引擎产生的大型和高度多样化的3D人类数据集。我们把研究扩大到更大的模型,以观察相同的改进、行动和情景,GTA-H人既是有效的培训来源。值得注意的是,“数据不合理的有效性”现象在3D人类恢复中得到验证,利用我们的游戏游戏数据得到进一步开发。一个简单的基于框架的基线,已经大大超越了更复杂的方法。对于基于视频的方法,GTA-H人类显示的优越性甚至超越了现场训练。我们把研究扩大到更大的模型,以观察同样的改进,关于监督信号的研究表明,“SMPL 数据”现象在3D上得到了丰富的真实的收集,而具有全球范围变化的SMTA图象系统化的模型是关键。