Reconstructing interacting hands from a single RGB image is a very challenging task. On the one hand, severe mutual occlusion and similar local appearance between two hands confuse the extraction of visual features, resulting in the misalignment of estimated hand meshes and the image. On the other hand, there are complex interaction patterns between interacting hands, which significantly increases the solution space of hand poses and increases the difficulty of network learning. In this paper, we propose a decoupled iterative refinement framework to achieve pixel-alignment hand reconstruction while efficiently modeling the spatial relationship between hands. Specifically, we define two feature spaces with different characteristics, namely 2D visual feature space and 3D joint feature space. First, we obtain joint-wise features from the visual feature map and utilize a graph convolution network and a transformer to perform intra- and inter-hand information interaction in the 3D joint feature space, respectively. Then, we project the joint features with global information back into the 2D visual feature space in an obfuscation-free manner and utilize the 2D convolution for pixel-wise enhancement. By performing multiple alternate enhancements in the two feature spaces, our method can achieve an accurate and robust reconstruction of interacting hands. Our method outperforms all existing two-hand reconstruction methods by a large margin on the InterHand2.6M dataset. Meanwhile, our method shows a strong generalization ability for in-the-wild images.
翻译:从一个 RGB 图像中重建交互手是一个非常艰巨的任务。 一方面, 两只手之间严格的相互封闭和类似的本地外观混淆了视觉特征的提取,导致估计手模和图像的错配。 另一方面, 交互手之间有复杂的互动模式, 这大大增加了手势的溶液空间, 增加了网络学习的难度。 在本文件中, 我们提出一个解开的迭代完善框架, 以实现像素对齐手的手部重建, 同时有效地模拟手之间的空间关系。 具体地说, 我们定义了两个具有不同特点的特征空间, 即 2D 视觉特征空间和 3D 联合地貌空间。 首先, 我们从视觉特征图中获取了联合的特征特征特征特征特征, 并使用图图变网络和变异器分别进行3D 联合地段空间的内和手间信息互动。 然后, 我们用一个不易理解的方式将全球信息连接到 2D 的视觉特征空间中, 并利用 2D 相变形法加强平比值 。 通过在两个空间里进行多重的重建方法,, 我们现有的两个功能变形图变形图变换的方法, 可以实现我们整个的图变形方法 。