We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images. Instead of estimating 3D joint locations from costly volumetric representation or reconstructing the per-person 3D pose from multiple detected 2D poses as in previous methods, MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks. Specifically, MvP represents skeleton joints as learnable query embeddings and let them progressively attend to and reason over the multi-view information from the input images to directly regress the actual 3D joint locations. To improve the accuracy of such a simple pipeline, MvP presents a hierarchical scheme to concisely represent query embeddings of multi-person skeleton joints and introduces an input-dependent query adaptation approach. Further, MvP designs a novel geometrically guided attention mechanism, called projective attention, to more precisely fuse the cross-view information for each joint. MvP also introduces a RayConv operation to integrate the view-dependent camera geometry into the feature representations for augmenting the projective attention. We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient. Notably, it achieves 92.3% AP25 on the challenging Panoptic dataset, improving upon the previous best approach [36] by 9.8%. MvP is general and also extendable to recovering human mesh represented by the SMPL model, thus useful for modeling multi-person body shapes. Code and models are available at https://github.com/sail-sg/mvp.
翻译:我们用多视图 Pose 变压器( MvP ) 来估算多人 3D 配置的多视图 3D 配置。 MvP 不是用昂贵的体积表示来估算3D 组合位置, 也不是用以往方法那样的多种2D 配置来估算3D 组合位置,而是直接以清洁和高效的方式回归多人 3D 组合, 而不依赖中间任务。 具体地说, MvP 代表骨架连接点, 作为可学习的查询嵌入器, 让他们逐渐关注和理解从输入图像到直接回归实际的 3D 联合位置的多视图信息。 为了提高这种简单的管道的准确性, MvP 提出一个等级方案, 简明地代表多人的骨架组合, 并引入一个基于投入的调控方法。 MvPL 也通过实验性的方式将基于查看的图像模型/ IMFMP 的图像测量方法扩展到了 IM 。