Top-down methods dominate the field of 3D human pose and shape estimation, because they are decoupled from human detection and allow researchers to focus on the core problem. However, cropping, their first step, discards the location information from the very beginning, which makes themselves unable to accurately predict the global rotation in the original camera coordinate system. To address this problem, we propose to Carry Location Information in Full Frames (CLIFF) into this task. Specifically, we feed more holistic features to CLIFF by concatenating the cropped-image feature with its bounding box information. We calculate the 2D reprojection loss with a broader view of the full frame, taking a projection process similar to that of the person projected in the image. Fed and supervised by global-location-aware information, CLIFF directly predicts the global rotation along with more accurate articulated poses. Besides, we propose a pseudo-ground-truth annotator based on CLIFF, which provides high-quality 3D annotations for in-the-wild 2D datasets and offers crucial full supervision for regression-based methods. Extensive experiments on popular benchmarks show that CLIFF outperforms prior arts by a significant margin, and reaches the first place on the AGORA leaderboard (the SMPL-Algorithms track). The code and data are available at https://github.com/huawei-noah/noah-research/tree/master/CLIFF.
翻译:3D 人类外观和形状估计领域由上到下的方法主导,因为它们与人体探测脱钩,使研究人员能够关注核心问题。然而,它们的第一步是裁剪,从一开始就抛弃定位信息,这使得自己无法准确预测原始相机协调系统的全球轮换。为了解决这个问题,我们提议将位置信息放在完整框架(CLIFF) 中进行这项工作。具体地说,我们向CLIFF提供更全面的特征,将作物成像特征与其捆绑框信息混为一格。我们用更宽的全框架来计算2D再预测损失,采用类似于图像中预测的人的预测程序。CLIFF通过全球定位识别信息直接预测全球轮换,同时提出更准确的配置。此外,我们提议以CLIFF为基础建立一个假地面图解塔,该图提供高质量的2D数据集,并为基于回归的模型提供至关重要的全面监督,同时采用与图像所预测的人相类似的预测程序。 CLALIFF/CRial 初步的实验显示CLM 和Gireal 。