Existing methods proposed for hand reconstruction tasks usually parameterize a generic 3D hand model or predict hand mesh positions directly. The parametric representations consisting of hand shapes and rotational poses are more stable, while the non-parametric methods can predict more accurate mesh positions. In this paper, we propose to reconstruct meshes and estimate MANO parameters of two hands from a single RGB image simultaneously to utilize the merits of two kinds of hand representations. To fulfill this target, we propose novel Mesh-Mano interaction blocks (MMIBs), which take mesh vertices positions and MANO parameters as two kinds of query tokens. MMIB consists of one graph residual block to aggregate local information and two transformer encoders to model long-range dependencies. The transformer encoders are equipped with different asymmetric attention masks to model the intra-hand and inter-hand attention, respectively. Moreover, we introduce the mesh alignment refinement module to further enhance the mesh-image alignment. Extensive experiments on the InterHand2.6M benchmark demonstrate promising results over the state-of-the-art hand reconstruction methods.
翻译:现有的手部重建方法通常采用参数化的通用3D手部模型或直接预测手部网格位置。参数化表示包含手部形状和旋转姿势,更为稳定,而非参数方法可以预测更准确的网格位置。本文提出了一种方法,从单个RGB图像中同时重建双手的网格和估计 MANO 参数,以利用两种手部表示的优点。为了实现这个目标,我们引入了新颖的网格-Mano交互块 (MMIBs),它将网格顶点位置和MANO参数作为两种查询令牌。MMIB由一个图形残差块和两个变换器编码器组成,用于聚合局部信息和建模长距离依赖性。变换器编码器配备不同的非对称注意掩码,分别用于建模内部手和手之间的注意力。此外,我们介绍网格对齐细化模块,进一步增强网格 -图像对齐。在 InterHand2.6M 基准测试上的大量实验表明,与最先进的手部重建方法相比,我们的方法具有良好的效果。