Acquiring accurate 3D annotated data for hand pose estimation is a notoriously difficult problem. This typically requires complex multi-camera setups and controlled conditions, which in turn creates a domain gap that is hard to bridge to fully unconstrained settings. Encouraged by the success of contrastive learning on image classification tasks, we propose a new self-supervised method for the structured regression task of 3D hand pose estimation. Contrastive learning makes use of unlabeled data for the purpose of representation learning via a loss formulation that encourages the learned feature representations to be invariant under any image transformation. For 3D hand pose estimation, it too is desirable to have invariance to appearance transformation such as color jitter. However, the task requires equivariance under affine transformations, such as rotation and translation. To address this issue, we propose an equivariant contrastive objective and demonstrate its effectiveness in the context of 3D hand pose estimation. We experimentally investigate the impact of invariant and equivariant contrastive objectives and show that learning equivariant features leads to better representations for the task of 3D hand pose estimation. Furthermore, we show that a standard ResNet-152, trained on additional unlabeled data, attains an improvement of $7.6\%$ in PA-EPE on FreiHAND and thus achieves state-of-the-art performance without any task specific, specialized architectures.
翻译:需要准确的 3D 附加说明的数据来进行手势估计是一个臭名昭著的困难问题。这通常需要复杂的多镜头设置和受控条件,而这反过来又造成难以弥合完全不受限制的环境的域际差距。由于在图像分类任务方面差异化学习的成功,我们为3D 手结构回归任务提出了一种新的自监督方法,我们为3D 手结构回归任务提出了估计。对比学习利用未贴标签的数据,通过损失公式学习,鼓励在任何图像转换中学习到的特征显示是不会变化的。对于3D 手构成估计,也不宜出现色调等外观变形。然而,这项任务需要在亲近变换(如轮用和翻译)下出现差异性差异。为了解决这一问题,我们提出了一种不均匀的对比目标,并在3D 手估计背景下展示其有效性。我们实验性地调查不变化和不均匀的对比目标的影响,并表明学习不均匀特征导致3D PE 手型结构的改进,因此,我们展示了在不经过培训的PA 7-H 结构中实现一个标准性任务。