This paper focuses on the regression of multiple 3D people from a single RGB image. Existing approaches predominantly follow a multi-stage pipeline, which first detects people with the bounding boxes and then regresses their 3D body meshes. In contrast, we propose to Regress all meshes in a One-stage fashion for Multiple 3D People (termed ROMP), which is conceptually simple, bounding box-free, and able to learn per-pixel representation in an end-to-end manner. Our method simultaneously predicts a Body Center heatmap and a Mesh Parameter map, which can jointly describe the 3D body mesh on the pixel level. Through a body-center-guided sampling process, the body mesh parameters of all people in the image can be easily extracted from the Mesh Parameter map. Equipped with such a fine-grained representation, our one-stage framework is free of the complex multi-stage process and more robust to occlusion. Compared with the state-of-the-art methods, ROMP achieves superior performance on the challenging multi-person/occlusion benchmarks, including 3DPW, CMU Panoptic, and 3DOH50K. Experiments on crowded/occluded datasets demonstrate the robustness under various types of occlusion. It is also worth noting that our released demo code ( https://github.com/Arthur151/ROMP ) is the first real-time (over 30 FPS) implementation of monocular multi-person 3D mesh regression to date.
翻译:本文侧重于多个 3D 人从一个 RGB 图像中回归的问题。 现有方法主要遵循多阶段管道, 首先是用捆绑框检测人, 然后是折叠3D 体模模模。 相反, 我们提议以一个阶段的方式将所有3D 人( 包括 ROMP ) 的体模参数从Mesh Parame 地图中解析出来。 这个模块在概念上很简单, 不带框框, 能够以端到端的方式学习每个像素的表达方式。 我们的方法同时预测一个身体中心的热映射和Mesh Parame 地图, 它可以共同描述像素水平上的 3D 体模像。 通过一个 体中心- 制导的取样程序, 图像中的所有人的体模样参数都可以很容易地从 Mesh Parameter 地图中提取出来。 以如此精细的缩放的表示方式, 我们的一阶段框架没有复杂的多阶段进程, 并且更坚固的加固。 与州- 方法相比, ROMP 在具有挑战性 3- CML/ CML 数据 的深度 的深度 的 3- cloisal- cloveal 标准下, 3- clo, 3- mission 3- mission 。