This paper studies the task of estimating the 3D human poses of multiple persons from multiple calibrated camera views. Following the top-down paradigm, we decompose the task into two stages, i.e. person localization and pose estimation. Both stages are processed in coarse-to-fine manners. And we propose three task-specific graph neural networks for effective message passing. For 3D person localization, we first use Multi-view Matching Graph Module (MMG) to learn the cross-view association and recover coarse human proposals. The Center Refinement Graph Module (CRG) further refines the results via flexible point-based prediction. For 3D pose estimation, the Pose Regression Graph Module (PRG) learns both the multi-view geometry and structural relations between human joints. Our approach achieves state-of-the-art performance on CMU Panoptic and Shelf datasets with significantly lower computation complexity.
翻译:本文研究从多个校准相机视图中估算多人的 3D 人构成的任务。 按照自上而下的模式,我们将任务分解为两个阶段,即个人定位和估计。两个阶段都以粗略至细微的方式处理。我们提出三个任务专用的图形神经网络,以有效传递信息。对于3D 人定位,我们首先使用多视图匹配图形模块(MMG)学习交叉视图协会,并回收粗糙的人类提案。中央精炼图模块(CRG)通过灵活的点基预测进一步完善结果。对于3D 显示的估算,Pose Regrestition图模块(PRG)既学习多视图的几何形状,又学习人类联合体之间的结构关系。我们的方法在CMU光谱和储物数据集中取得最先进的性表现,而计算复杂性则低得多。