人-人-人-人-人-人-人-人-人-计算机愿景合成数据生成器 (PeopleSansPeople: A Synthetic Data Generator for Human-Centric Computer Vision)

from arxiv, PeopleSansPeople template Unity environment, benchmark binaries, and source code is available at: https://github.com/Unity-Technologies/PeopleSansPeople

In recent years, person detection and human pose estimation have made great strides, helped by large-scale labeled datasets. However, these datasets had no guarantees or analysis of human activities, poses, or context diversity. Additionally, privacy, legal, safety, and ethical concerns may limit the ability to collect more human data. An emerging alternative to real-world data that alleviates some of these issues is synthetic data. However, creation of synthetic data generators is incredibly challenging and prevents researchers from exploring their usefulness. Therefore, we release a human-centric synthetic data generator PeopleSansPeople which contains simulation-ready 3D human assets, a parameterized lighting and camera system, and generates 2D and 3D bounding box, instance and semantic segmentation, and COCO pose labels. Using PeopleSansPeople, we performed benchmark synthetic data training using a Detectron2 Keypoint R-CNN variant [1]. We found that pre-training a network using synthetic data and fine-tuning on various sizes of real-world data resulted in a keypoint AP increase of $+38.03$ ($44.43 \pm 0.17$ vs. $6.40$) for few-shot transfer (limited subsets of COCO-person train [2]), and an increase of $+1.47$ ($63.47 \pm 0.19$ vs. $62.00$) for abundant real data regimes, outperforming models trained with the same real data alone. We also found that our models outperformed those pre-trained with ImageNet with a keypoint AP increase of $+22.53$ ($44.43 \pm 0.17$ vs. $21.90$) for few-shot transfer and $+1.07$ ($63.47 \pm 0.19$ vs. $62.40$) for abundant real data regimes. This freely-available data generator should enable a wide range of research into the emerging field of simulation to real transfer learning in the critical area of human-centric computer vision.

翻译：近些年来,在大规模贴标签的数据集的帮助下,人类探测和人造面貌估计已经取得了巨大的进步。然而,这些数据集没有对人类活动进行保证或分析,没有代表或背景多样性。此外,隐私、法律、安全和伦理问题可能会限制收集更多人类数据的能力。对于减轻其中某些问题的现实世界数据,一个新出现的替代办法是合成数据生成器。然而,合成数据生成器的创建令人难以置信地具有挑战性,妨碍了研究人员探索其用途。因此,我们发行了一个以人为中心的合成数据生成器,其中含有模拟的3D人类资产,一个参数化的照明和摄像系统,并且生成了2D和3D的捆绑框、实例和语义分割以及COCO的标签。我们使用人造面数据采集器,用合成数据生成器2 Keypoint R-CN 变量[1]。我们发现,使用合成数据生成的网络和对真实数据规模的精确数据生成的精确数据生成过程应该增加38.03美元(44美元),用标值的照明的照明系统生成了0.17美元,用经过训练的模型将数据转换为640美元。