We present THUNDR, a transformer-based deep neural network methodology to reconstruct the 3d pose and shape of people, given monocular RGB images. Key to our methodology is an intermediate 3d marker representation, where we aim to combine the predictive power of model-free-output architectures and the regularizing, anthropometrically-preserving properties of a statistical human surface model like GHUM -- a recently introduced, expressive full body statistical 3d human model, trained end-to-end. Our novel transformer-based prediction pipeline can focus on image regions relevant to the task, supports self-supervised regimes, and ensures that solutions are consistent with human anthropometry. We show state-of-the-art results on Human3.6M and 3DPW, for both the fully-supervised and the self-supervised models, for the task of inferring 3d human shape, joint positions, and global translation. Moreover, we observe very solid 3d reconstruction performance for difficult human poses collected in the wild.
翻译:我们提出THUNDR,这是一个基于变压器的深层神经网络方法,用于重建3D的构成和形状,以单方的 RGB 图像为基础。我们的方法的关键是一个中间的3D标记表示,我们的目标是将模型无产出结构的预测力与诸如GHUM这样的人类统计表面模型的正常化、人类测量保存特性结合起来,该模型是最近引入的、表达式的完整人体统计3模型,经过培训的终端到终端。我们新的基于变压器的预测管道可以侧重于与任务相关的图像区域,支持自我监督的制度,并确保解决方案与人类的人体测量相一致。我们展示了人类3.6M 和 3DPW 的最新艺术结果, 用于完全监控和自我监督的模型, 用于推导出3D型人类形状、联合位置和全球翻译。此外,我们观察了在野生中收集的困难人体构成的非常坚实的3D重建表现。