This paper presents a simple yet powerful method for 3D human mesh reconstruction from a single RGB image. Most recently, the non-local interactions of the whole mesh vertices have been effectively estimated in the transformer while the relationship between body parts also has begun to be handled via the graph model. Even though those approaches have shown the remarkable progress in 3D human mesh reconstruction, it is still difficult to directly infer the relationship between features, which are encoded from the 2D input image, and 3D coordinates of each vertex. To resolve this problem, we propose to design a simple feature sampling scheme. The key idea is to sample features in the embedded space by following the guide of points, which are estimated as projection results of 3D mesh vertices (i.e., ground truth). This helps the model to concentrate more on vertex-relevant features in the 2D space, thus leading to the reconstruction of the natural human pose. Furthermore, we apply progressive attention masking to precisely estimate local interactions between vertices even under severe occlusions. Experimental results on benchmark datasets show that the proposed method efficiently improves the performance of 3D human mesh reconstruction. The code and model are publicly available at: https://github.com/DCVL-3D/PointHMR_release.
翻译:本文提出了一种简单而强大的方法,用于从单个RGB图像重建三维人体网格。最近,在转换器中成功估计了整个网格顶点的非局部交互,而身体部位之间的关系也开始通过图形模型来处理。尽管这些方法在3D人体网格重建方面取得了显着的进展,但仍然难以直接推断从2D输入图像编码的特征和每个顶点的3D坐标之间的关系。为了解决这个问题,我们建议设计一个简单的特征采样方案。关键思想是通过遵循点的指引在嵌入空间中采样特征,这些点被估计为3D网格顶点的投影结果(即,ground truth)。这有助于模型更集中于2D空间中与顶点相关的特征,从而导致自然人体姿态的重建。此外,我们应用渐进式注意掩模来准确估计严重遮挡下顶点之间的局部交互。基准数据集上的实验结果表明,所提出的方法有效地提高了3D人体网格重建的性能。代码和模型公开可用于:https://github.com/DCVL-3D/PointHMR_release。