The recovery of multi-person 3D poses from a single RGB image is a severely ill-conditioned problem due to the inherent 2D-3D depth ambiguity, inter-person occlusions, and body truncations. To tackle these issues, recent works have shown promising results by simultaneously reasoning for different people. However, in most cases this is done by only considering pairwise person interactions, hindering thus a holistic scene representation able to capture long-range interactions. This is addressed by approaches that jointly process all people in the scene, although they require defining one of the individuals as a reference and a pre-defined person ordering, being sensitive to this choice. In this paper, we overcome both these limitations, and we propose an approach for multi-person 3D pose estimation that captures long-range interactions independently of the input order. For this purpose, we build a residual-like permutation-invariant network that successfully refines potentially corrupted initial 3D poses estimated by an off-the-shelf detector. The residual function is learned via Set Transformer blocks, that model the interactions among all initial poses, no matter their ordering or number. A thorough evaluation demonstrates that our approach is able to boost the performance of the initially estimated 3D poses by large margins, achieving state-of-the-art results on standardized benchmarks. Additionally, the proposed module works in a computationally efficient manner and can be potentially used as a drop-in complement for any 3D pose detector in multi-people scenes.
翻译:从单一 RGB 图像中恢复多人 3D 的立体成像是一个严重不成熟的问题,原因是2D-3D 内在的深度模糊、人际隔离和体积截断。为了解决这些问题,最近的工作通过同时推理不同的人,显示了令人乐观的结果。然而,在大多数情况下,这只是通过考虑双人互动,从而阻碍一个能够捕捉远程互动的整体场景代表,通过共同处理现场所有人的方法加以解决,尽管它们需要将一个人确定为参考方和预先确定的人订购者,并对此选择敏感。在本文中,我们克服了这些限制,我们提出了多人三D 构成估计的方法,该方法可以独立地捕捉不同输入顺序的远程互动。为此,我们建立了一个类似残余的变异性动态网络,能够成功地完善由现成探测器外探测器估计的最初3D 3D 构成。残余功能通过Set 变异方块来学习,将所有初始的相互作用模式建成,而不是其定序或数字。我们提出了一种方法,通过初始的3D 平级计算模型,从而实现一个潜在的测算模型中的任何测算结果。