We propose a robust and accurate method for estimating the 3D poses of two hands in close interaction from a single color image. This is a very challenging problem, as large occlusions and many confusions between the joints may happen. State-of-the-art methods solve this problem by regressing a heatmap for each joint, which requires solving two problems simultaneously: localizing the joints and recognizing them. In this work, we propose to separate these tasks by relying on a CNN to first localize joints as 2D keypoints, and on self-attention between the CNN features at these keypoints to associate them with the corresponding hand joint. The resulting architecture, which we call "Keypoint Transformer", is highly efficient as it achieves state-of-the-art performance with roughly half the number of model parameters on the InterHand2.6M dataset. We also show it can be easily extended to estimate the 3D pose of an object manipulated by one or two hands with high performance. Moreover, we created a new dataset of more than 75,000 images of two hands manipulating an object fully annotated in 3D and will make it publicly available.
翻译:我们提出一个可靠和准确的方法来从单一颜色图像中密切互动来估计两只手的三维成份。 这是一个非常棘手的问题,因为可能发生大型隔热和关节之间的许多混乱。 最先进的方法通过向每个联合的回退热映射来解决这个问题, 这需要同时解决两个问题: 将连接定位并承认它们。 在这项工作中, 我们提议将这些任务分开, 依靠CNN将第一个连接定位为 2D 关键点, 以及在这些关键点的CNN 功能之间自我注意, 将它们与相应的手接头。 由此产生的结构( 我们称之为“ Keypoint 变换器 ” ) 效率很高, 因为它能达到最先进的性能, 大约是 InterHand2.6M 数据集的模型参数数的一半。 我们还表明, 可以轻易地扩大这些任务的范围, 来估计由一手或两只手操纵的物体的3D 组合为 2D 关键点, 。 此外, 我们创造了一个新的数据集, 超过 75,000 5 000 张两只手完全操纵3D 中附加说明的物体的图像, 。