3D interacting hand pose estimation from a single RGB image is a challenging task, due to serious self-occlusion and inter-occlusion towards hands, confusing similar appearance patterns between 2 hands, ill-posed joint position mapping from 2D to 3D, etc.. To address these, we propose to extend A2J-the state-of-the-art depth-based 3D single hand pose estimation method-to RGB domain under interacting hand condition. Our key idea is to equip A2J with strong local-global aware ability to well capture interacting hands' local fine details and global articulated clues among joints jointly. To this end, A2J is evolved under Transformer's non-local encoding-decoding framework to build A2J-Transformer. It holds 3 main advantages over A2J. First, self-attention across local anchor points is built to make them global spatial context aware to better capture joints' articulation clues for resisting occlusion. Secondly, each anchor point is regarded as learnable query with adaptive feature learning for facilitating pattern fitting capacity, instead of having the same local representation with the others. Last but not least, anchor point locates in 3D space instead of 2D as in A2J, to leverage 3D pose prediction. Experiments on challenging InterHand 2.6M demonstrate that, A2J-Transformer can achieve state-of-the-art model-free performance (3.38mm MPJPE advancement in 2-hand case) and can also be applied to depth domain with strong generalization.
翻译:从单个 RGB 图像中估计交互手部姿态的 3D 是一项具有挑战性的任务,由于手部自遮挡和相互遮挡、两只手之间相似的外观模式、2D 到 3D 的关节位置映射问题等原因。为了解决这些问题,我们提出了 A2J-Transformer,将当前最先进的基于深度的 3D 单手部姿态估计方法 A2J 扩展到 RGB 领域中的交互手条件下。我们的核心思想是为 A2J 赋予强大的局部-全局感知能力,以共同捕获交互手部的局部细节和全局关节线索。为此,我们将 A2J 发展为 Transformer 的非局部编码-解码框架来构建 A2J-Transformer。它相对于 A2J 具有三个主要优点。首先,构建锚点之间的自注意力,使它们能够更好地捕获关节的关节线索,以抵抗遮挡。其次,将每个锚点视为可学习的查询,并进行自适应特征学习,以促进模式拟合能力,而不是与其他锚点具有相同的本地表示。最后,锚点位于 3D 空间而不是 A2J 中的 2D,以利用 3D 姿势预测。在具有挑战性的 InterHand 2.6M 上进行的实验表明,A2J-Transformer 可以实现最先进的无模型性能(在两只手的情况下 3.38mm MPJPE 提升),也可以应用于深度领域,并具有良好的泛化性能。