单阶段三维全身网格恢复与部件感知变换器 (One-Stage 3D Whole-Body Mesh Recovery with Component Aware Transformer)

Whole-body mesh recovery aims to estimate the 3D human body, face, and hands parameters from a single image. It is challenging to perform this task with a single network due to resolution issues, i.e., the face and hands are usually located in extremely small regions. Existing works usually detect hands and faces, enlarge their resolution to feed in a specific network to predict the parameter, and finally fuse the results. While this copy-paste pipeline can capture the fine-grained details of the face and hands, the connections between different parts cannot be easily recovered in late fusion, leading to implausible 3D rotation and unnatural pose. In this work, we propose a one-stage pipeline for expressive whole-body mesh recovery, named OSX, without separate networks for each part. Specifically, we design a Component Aware Transformer (CAT) composed of a global body encoder and a local face/hand decoder. The encoder predicts the body parameters and provides a high-quality feature map for the decoder, which performs a feature-level upsample-crop scheme to extract high-resolution part-specific features and adopt keypoint-guided deformable attention to estimate hand and face precisely. The whole pipeline is simple yet effective without any manual post-processing and naturally avoids implausible prediction. Comprehensive experiments demonstrate the effectiveness of OSX. Lastly, we build a large-scale Upper-Body dataset (UBody) with high-quality 2D and 3D whole-body annotations. It contains persons with partially visible bodies in diverse real-life scenarios to bridge the gap between the basic task and downstream applications.

翻译：全身网格恢复旨在从单张图像中估计3D人体，面部和手部参数。由于面部和手部通常位于极小的区域，因此使用单个网络执行此任务具有挑战性。现有作品通常检测手和面部，将其放大以输入特定网络来预测参数，最后融合结果。虽然这种复制粘贴管道可以捕捉面部和手部的细节，但不同部分之间的联系不易在后期融合中恢复，导致不合理的三维旋转和不自然的姿势。在这项工作中，我们提出了一种用于表现全身网格恢复的单阶段管道，名为OSX（Component Aware Transformer with One-Stage Execution），不需要每个部分的单独网络。具体而言，我们设计了一个部件感知变换器（CAT），由全局身体编码器和局部面部/手部解码器组成。编码器预测身体参数并为解码器提供高质量的特征图，后者执行特征级上采样-裁剪方案以提取高分辨率的部分特定特征，并采用关键点引导的可变形注意力精确估计手和面部。整个管道简单而有效，不需要人工后处理，并且自然地避免了不合理的预测。全面的实验表明了OSX的有效性。最后，我们构建了一个具有高质量2D和3D全身注释的大型上身数据集（UBody），其中包含出现于各种真实情景中的部分可见身体，以弥合基本任务和下游应用之间的差距。