We present a self-trainable method, Mask2Hand, which learns to solve the challenging task of predicting 3D hand pose and shape from a 2D binary mask of hand silhouette/shadow without additional manually-annotated data. Given the intrinsic camera parameters and the parametric hand model in the camera space, we adopt the differentiable rendering technique to project 3D estimations onto the 2D binary silhouette space. By applying a tailored combination of losses between the rendered silhouette and the input binary mask, we are able to integrate the self-guidance mechanism into our end-to-end optimization process for constraining global mesh registration and hand pose estimation. The experiments show that our method, which takes a single binary mask as the input, can achieve comparable prediction accuracy on both unaligned and aligned settings as state-of-the-art methods that require RGB or depth inputs.
翻译:我们展示了一种可自我学习的方法,Mask2Hand, 它学会了解决从手的二维双向面罩中预测3D手的姿势和形状的艰巨任务,而没有额外的人工附加说明的数据。根据摄像空间的内在摄像参数和参数手模型,我们采用了不同的演化技术,将3D的估算投射到2D的二进制双向侧面罩空间上。通过将所提供的光影和输入的双向面罩之间的损失量身组合在一起,我们能够将自我监督机制纳入我们的端到端优化进程,以限制全球网状登记和手表面估计。实验表明,我们的方法,以单一的双向面罩作为输入,可以在不相容和对齐的设置上实现可比的预测准确性,作为需要 RGB 或深度输入的状态方法。