Estimating the 3D pose of a hand from a 2D image is a well-studied problem and a requirement for several real-life applications such as virtual reality, augmented reality, and hand gesture recognition. Currently, reasonable estimations can be computed from single RGB images, especially when a multi-task learning approach is used to force the system to consider the shape of the hand when its pose is determined. However, depending on the method used to represent the hand, the performance can drop considerably in real-life tasks, suggesting that stable descriptions are required to achieve satisfactory results. In this paper, we present a keypoint-based end-to-end framework for 3D hand and pose estimation and successfully apply it to the task of hand gesture recognition as a study case. Specifically, after a pre-processing step in which the images are normalized, the proposed pipeline uses a multi-task semantic feature extractor generating 2D heatmaps and hand silhouettes from RGB images, a viewpoint encoder to predict the hand and camera view parameters, a stable hand estimator to produce the 3D hand pose and shape, and a loss function to guide all of the components jointly during the learning phase. Tests were performed on a 3D pose and shape estimation benchmark dataset to assess the proposed framework, which obtained state-of-the-art performance. Our system was also evaluated on two hand-gesture recognition benchmark datasets and significantly outperformed other keypoint-based approaches, indicating that it is an effective solution that is able to generate stable 3D estimates for hand pose and shape.
翻译:从 2D 图像中估算 3D 的 3D 形状是一个研究周密的问题,需要几种真实的应用程序,例如虚拟现实、增强的现实和手势识别。目前,可以从一个 RGB 图像中计算合理的估计,特别是当使用多任务学习方法迫使系统在 2D 图像确定时考虑手的形状。然而,根据实际生活中任务所使用的表示方法,性能可以大幅下降,表明需要稳定的描述才能取得令人满意的结果。在本文中,我们为 3D 手提出基于关键点的端对端框架,作出估计并成功地将其应用到手势识别任务中,作为研究案例。具体地说,在图像正常化的预处理步骤之后,拟议的管道使用多功能性能测色素提取器,从 RGB 图像中产生手势测图和手脚底图,预测手动和摄影观察参数需要稳定的描述。一个稳定的手势测底框架,用来生成3D 3D 手动显示3D 手势识别结果和模型中进行双级测得的状态,测试阶段的一个测试和损失函数是所有测试阶段。