We propose a robust and accurate method for estimating the 3D poses of two hands in close interaction from a single color image. This is a very challenging problem, as large occlusions and many confusions between the joints may happen. Our method starts by extracting a set of potential 2D locations for the joints of both hands as extrema of a heatmap. We do not require that all locations correctly correspond to a joint, not that all the joints are detected. We use appearance and spatial encodings of these locations as input to a transformer, and leverage the attention mechanisms to sort out the correct configuration of the joints and output the 3D poses of both hands. Our approach thus allies the recognition power of a Transformer to the accuracy of heatmap-based methods. We also show it can be extended to estimate the 3D pose of an object manipulated by one or two hands. We evaluate our approach on the recent and challenging InterHand2.6M and HO-3D datasets. We obtain 17% improvement over the baseline. Moreover, we introduce the first dataset made of action sequences of two hands manipulating an object fully annotated in 3D and will make it publicly available.
翻译:我们建议了一种可靠和准确的方法,用一个颜色图像来密切互动来估计两只手的三维成像。 这是一个非常棘手的问题,因为两只手之间可能会发生巨大的隔热和许多混乱。 我们的方法是从抽取两只手接头的一组潜在的二维位置开始,作为热映射的外形。 我们并不要求所有位置都正确对应一个联合, 而不是所有连接都检测到。 我们使用这些位置的外观和空间编码作为变压器的输入, 并利用关注机制来理清连接和输出两只手的三维成像的正确配置。 我们的方法因此将变异器的识别力与基于热映射的方法的准确性联系起来。 我们还可以显示它可以被扩大, 来估计一手或两只手操纵的对象的三维构成。 我们评估了我们最近对具有挑战性的InterHand2.6M和HO-3D数据集的处理方法。 我们对基线做了17%的改进。 此外, 我们引入了两个手操纵对象的行动序列的第一个数据集, 在 3D 中可以完全使用。