3D interacting hand reconstruction is essential to facilitate human-machine interaction and human behaviors understanding. Previous works in this field either rely on auxiliary inputs such as depth images or they can only handle a single hand if monocular single RGB images are used. Single-hand methods tend to generate collided hand meshes, when applied to closely interacting hands, since they cannot model the interactions between two hands explicitly. In this paper, we make the first attempt to reconstruct 3D interacting hands from monocular single RGB images. Our method can generate 3D hand meshes with both precise 3D poses and minimal collisions. This is made possible via a two-stage framework. Specifically, the first stage adopts a convolutional neural network to generate coarse predictions that tolerate collisions but encourage pose-accurate hand meshes. The second stage progressively ameliorates the collisions through a series of factorized refinements while retaining the preciseness of 3D poses. We carefully investigate potential implementations for the factorized refinement, considering the trade-off between efficiency and accuracy. Extensive quantitative and qualitative results on large-scale datasets such as InterHand2.6M demonstrate the effectiveness of the proposed approach.
翻译:3D 交互手部重建对于促进人体- 机器互动和人类行为理解至关重要。 本领域以前的工作要么依靠深度图像等辅助投入,要么只有在使用单眼单一 RGB 图像时,它们才能操作一只手。 单手方法在应用到密切互动的手时,往往产生相撞的手模件, 因为它们不能明确地模拟两只手之间的互动。 在本文中, 我们第一次尝试用单眼单一 RGB 图像来重建三D 交互手。 我们的方法可以生成三D 手模件, 精确的 3D 成份和最小的碰撞。 这是通过两阶段框架实现的。 具体地说, 第一阶段采用进化神经网络来产生粗粗的预测, 能够容忍碰撞, 但却鼓励 成形精准的手模。 第二阶段通过一系列因子化的改进来逐步改善碰撞, 同时保留 3D 构成的精确性。 我们仔细研究因子化精度精度改进的潜在执行情况, 考虑效率和精确度之间的权衡。 在诸如 InterHand2M 6M 方法等大规模数据设置的效果。