Estimating 3D poses and shapes in the form of meshes from monocular RGB images is challenging. Obviously, it is more difficult than estimating 3D poses only in the form of skeletons or heatmaps. When interacting persons are involved, the 3D mesh reconstruction becomes more challenging due to the ambiguity introduced by person-to-person occlusions. To tackle the challenges, we propose a coarse-to-fine pipeline that benefits from 1) inverse kinematics from the occlusion-robust 3D skeleton estimation and 2) Transformer-based relation-aware refinement techniques. In our pipeline, we first obtain occlusion-robust 3D skeletons for multiple persons from an RGB image. Then, we apply inverse kinematics to convert the estimated skeletons to deformable 3D mesh parameters. Finally, we apply the Transformer-based mesh refinement that refines the obtained mesh parameters considering intra- and inter-person relations of 3D meshes. Via extensive experiments, we demonstrate the effectiveness of our method, outperforming state-of-the-arts on 3DPW, MuPoTS and AGORA datasets.
翻译:以单面 RGB 图像为介质估计 3D 形形和形状的3D 形形和形状是具有挑战性的。显然,比以骨骼或热图的形式估计 3D 型更难。当互动人员参与时,3D 网形重建由于人与人之间隔离的模糊性而变得更具有挑战性。为了应对挑战,我们建议采用粗略至线形管道,其好处是:(1) 从单面 RGB 3D 骨骼估计和(2) 以变异器为基础的关系-系统改进技术获得的网状参数。在管道中,我们首先从 RGB 图像中为多人获取 3D 立方形 3D 骨骼。然后,我们应用反动运动学将估计的骨骼转换为可变形的 3D 网形参数。 最后,我们采用基于变形器的网状网状网状改进了从3D 模的内和人际关系中获得的网状参数。我们通过广泛的实验,展示了我们的方法的有效性,在 3W 数据 的 3W 上, 和 MA-DTS 。