3D 从 RGB 图像中进行手动和形状估计,以改进关键点手动识别 (3D Hand Pose and Shape Estimation from RGB Images for Improved Keypoint-Based Hand-Gesture Recognition)

Estimating the 3D hand pose from a 2D image is a well-studied problem and a requirement for several real-life applications such as virtual reality, augmented reality, and hand-gesture recognition. Currently, good estimations can be computed starting from single RGB images, especially when forcing the system to also consider, through a multi-task learning approach, the hand shape when the pose is determined. However, when addressing the aforementioned real-life tasks, performances can drop considerably depending on the hand representation, thus suggesting that stable descriptions are required to achieve satisfactory results. As a consequence, in this paper we present a keypoint-based end-to-end framework for the 3D hand and pose estimation, and successfully apply it to the hand-gesture recognition task as a study case. Specifically, after a pre-processing step where the images are normalized, the proposed pipeline comprises a multi-task semantic feature extractor generating 2D heatmaps and hand silhouettes from RGB images; a viewpoint encoder predicting hand and camera view parameters; a stable hand estimator producing the 3D hand pose and shape; and a loss function designed to jointly guide all of the components during the learning phase. To assess the proposed framework, tests were performed on a 3D pose and shape estimation benchmark dataset, obtaining state-of-the-art performances. What is more, the devised system was also evaluated on 2 hand-gesture recognition benchmark datasets, where the framework significantly outperforms other keypoint-based approaches; indicating that the presented method is an effective solution able to generate stable 3D estimates for the hand pose and shape.

翻译：从 2D 图像中估算 3D 手势的形状是一个很好研究的问题,是若干真实生活应用,例如虚拟现实、强化现实和手动感知等,要求若干真实生活应用,例如虚拟现实、增强现实和手动感知。目前,可以从一个 RGB 图像开始计算良好的估计,特别是当系统通过多任务学习方法,也不得不考虑,在2D 图像被确定时,手势的形状。然而,在处理上述实际生活任务时,性能可能大大下降,这取决于手动表示方式,从而表明需要稳定的描述才能取得令人满意的结果。因此,在本文件中,我们为 3D 手动和显示估计,成功地将基于关键点的端对端框架应用到手动感知觉感知。具体地说,在图像被正常化的预处理步骤之后,拟议管道包含一个多塔式的语谱特征提取器,生成2D 和手动感测图图像的手动感测图; 观点对手动和摄像器视图参数进行预测; 稳定的手动测图测图显示3D 显示的模型的模型显示阶段中, 将演示的3D 演示的模型显示到演示图的模型的模型的模型的模型的形状和形状的模型进行到方向。