While deep learning has reshaped the classical motion capture pipeline, generative, analysis-by-synthesis elements are still in use to recover fine details if a high-quality 3D model of the user is available. Unfortunately, obtaining such a model for every user a priori is challenging, time-consuming, and limits the application scenarios. We propose a novel test-time optimization approach for monocular motion capture that learns a volumetric body model of the user in a self-supervised manner. To this end, our approach combines the advantages of neural radiance fields with an articulated skeleton representation. Our proposed skeleton embedding serves as a common reference that links constraints across time, thereby reducing the number of required camera views from traditionally dozens of calibrated cameras, down to a single uncalibrated one. As a starting point, we employ the output of an off-the-shelf model that predicts the 3D skeleton pose. The volumetric body shape and appearance is then learned from scratch, while jointly refining the initial pose estimate. Our approach is self-supervised and does not require any additional ground truth labels for appearance, pose, or 3D shape. We demonstrate that our novel combination of a discriminative pose estimation technique with surface-free analysis-by-synthesis outperforms purely discriminative monocular pose estimation approaches and generalizes well to multiple views.
翻译:虽然深层次的学习改造了传统的运动抓取管道,但是,如果有高质量的3D用户模型,我们仍在使用基因化的、逐个分析的合成元素来恢复细细的细节。 不幸的是,为每个用户先验地获得这样一个模型具有挑战性、耗时和限制应用设想方案。 我们提议对单体运动捕捉采用新的试验时间优化方法,以自我监督的方式学习用户的体积模型。 为此,我们的方法将神经发光场的优势与清晰的骨骼代表组合结合起来。我们提议的骨骼嵌入是一种共同的参照,可以将各种限制联系在一起,从而将所需的相机观点从传统的几十个校准相机减少到单一的未经校准的相机。作为起点,我们采用现成模型的输出,以预测3D骨架的外形。然后从零开始学习体积体形和外貌,同时共同改进最初的面貌估计。我们的方法是自我监督的,不需要任何额外的地面真相标签,用以显示外观的外观、面貌或面形形形形色分析。我们展示了一种新型的外观和面面分析。