Hand pose estimation is difficult due to different environmental conditions, object- and self-occlusion as well as diversity in hand shape and appearance. Exhaustively covering this wide range of factors in fully annotated datasets has remained impractical, posing significant challenges for generalization of supervised methods. Embracing this challenge, we propose to combine ideas from adversarial training and motion modelling to tap into unlabeled videos. To this end we propose what to the best of our knowledge is the first motion model for hands and show that an adversarial formulation leads to better generalization properties of the hand pose estimator via semi-supervised training on unlabeled video sequences. In this setting, the pose predictor must produce a valid sequence of hand poses, as determined by a discriminative adversary. This adversary reasons both on the structural as well as temporal domain, effectively exploiting the spatio-temporal structure in the task. The main advantage of our approach is that we can make use of unpaired videos and joint sequence data both of which are much easier to attain than paired training data. We perform extensive evaluation, investigating essential components needed for the proposed framework and empirically demonstrate in two challenging settings that the proposed approach leads to significant improvements in pose estimation accuracy. In the lowest label setting, we attain an improvement of $40\%$ in absolute mean joint error.
翻译:由于不同的环境条件、物体和自我封闭以及手形和外观的多样性,很难估计手表的形状和外观,手表的形状和外观难以估计。在充分附加说明的数据集中,对如此广泛的因素进行详尽的覆盖仍然不切实际,对监督方法的概括化提出了重大挑战。我们迎接这一挑战,我们提议将对抗性培训和运动模型的想法结合起来,以便利用未贴标签的视频。为此,我们提议,我们最了解的是手掌的第一个运动模式,并表明,一种对抗性配方的配方配方能够通过半监督的录像序列培训,更好地概括手表成的估测器的属性。在这个环境中,姿势预测器必须产生由歧视性对手确定的有效的手势构成序列。在结构上和时间上,这种对立的理由都是有效的,有效地利用任务中的时空结构。我们的方法的主要优点是,我们可以使用非被动的视频和联合序列数据,两者都比配对培训数据更容易实现。我们进行了广泛的评估,对质的图像进行广泛的调查,在拟议框架的绝对性精确度上,我们提出了一个最有挑战性地展示了我们拟议框架所需的最起码的改进。