We consider the problem of recovering a single person's 3D human mesh from in-the-wild crowded scenes. While much progress has been in 3D human mesh estimation, existing methods struggle when test input has crowded scenes. The first reason for the failure is a domain gap between training and testing data. A motion capture dataset, which provides accurate 3D labels for training, lacks crowd data and impedes a network from learning crowded scene-robust image features of a target person. The second reason is a feature processing that spatially averages the feature map of a localized bounding box containing multiple people. Averaging the whole feature map makes a target person's feature indistinguishable from others. We present 3DCrowdNet that firstly explicitly targets in-the-wild crowded scenes and estimates a robust 3D human mesh by addressing the above issues. First, we leverage 2D human pose estimation that does not require a motion capture dataset with 3D labels for training and does not suffer from the domain gap. Second, we propose a joint-based regressor that distinguishes a target person's feature from others. Our joint-based regressor preserves the spatial activation of a target by sampling features from the target's joint locations and regresses human model parameters. As a result, 3DCrowdNet learns target-focused features and effectively excludes the irrelevant features of nearby persons. We conduct experiments on various benchmarks and prove the robustness of 3DCrowdNet to the in-the-wild crowded scenes both quantitatively and qualitatively. The code is available at https://github.com/hongsukchoi/3DCrowdNet_RELEASE.
翻译:我们考虑从拥挤的拥挤场景中恢复一个人的 3D 人类网格的问题。 虽然在3D 人类网格中取得了很大进展, 但当测试输入时, 现有方法在拥挤场子中挣扎。 失败的第一个原因是培训和测试数据之间的域差。 运动抓取数据集, 它为培训提供了准确的 3D 标签, 缺少人群数据, 妨碍网络学习一个目标人的拥挤场景- 紫色图像特征。 第二个原因是一个功能处理, 它在空间上平均一个包含多人的本地约束盒的特征图。 整个功能图的变异使得一个目标人无法与其他人区分。 我们展示的是 3DC 网络的3DC 区域网格特征, 它首先明确针对的是培训和测试数据。 首先, 我们利用 2D 人形图像估算, 它不需要一个带有 3D 模型的捕获数据集, 并且不会受到域域隔差距的影响。 其次, 我们提出一个基于联合的后退器, 将目标对象的不易变码 目标点的特性与 3LE 数据基点 的 样点与其它的 。