We present a method for reconstructing accurate and consistent 3D hands from a monocular video. We observe that detected 2D hand keypoints and the image texture provide important cues about the geometry and texture of the 3D hand, which can reduce or even eliminate the requirement on 3D hand annotation. Thus we propose ${\rm {S}^{2}HAND}$, a self-supervised 3D hand reconstruction model, that can jointly estimate pose, shape, texture, and the camera viewpoint from a single RGB input through the supervision of easily accessible 2D detected keypoints. We leverage the continuous hand motion information contained in the unlabeled video data and propose ${\rm {S}^{2}HAND(V)}$, which uses a set of weights shared ${\rm {S}^{2}HAND}$ to process each frame and exploits additional motion, texture, and shape consistency constrains to promote more accurate hand poses and more consistent shapes and textures. Experiments on benchmark datasets demonstrate that our self-supervised approach produces comparable hand reconstruction performance compared with the recent full-supervised methods in single-frame as input setup, and notably improves the reconstruction accuracy and consistency when using video training data.
翻译:我们提出了一种通过单目视频重建精准一致的三维手部的方法。我们观察到检测到的二维手部关键点和图像纹理提供了关于三维手部几何和纹理的重要线索,这可以减少甚至消除对三维手部注释的要求。因此,我们提出了${\rm {S}^{2}HAND}$,一种自监督的三维手部重建模型,可以通过易于获取的二维检测关键点的监督来联合估计单个RGB输入的姿势、形状、纹理和相机视点。我们利用未标记的视频数据中包含的连续手部运动信息,并提出${\rm {S}^{2}HAND(V)}$,该模型使用一组共享的${\rm {S}^{2}HAND}$权重来处理每个帧,并利用额外的运动、纹理和形状一致性约束来促进更准确的手部姿势和更一致的形状和纹理。对基准数据集的实验证明,我们的自监督方法在输入单帧时可以产生与最近的全监督方法相当的手部重建性能,并在使用视频训练数据时显着提高了重建精度和一致性。