This work proposes an end-to-end approach to estimate full 3D hand pose from stereo cameras. Most existing methods of estimating hand pose from stereo cameras apply stereo matching to obtain depth map and use depth-based solution to estimate hand pose. In contrast, we propose to bypass the stereo matching and directly estimate the 3D hand pose from the stereo image pairs. The proposed neural network architecture extends from any keypoint predictor to estimate the sparse disparity of the hand joints. In order to effectively train the model, we propose a large scale synthetic dataset that is composed of stereo image pairs and ground truth 3D hand pose annotations. Experiments show that the proposed approach outperforms the existing methods based on the stereo depth.
翻译:这项工作提出了从立体摄影机上对3D手姿势进行全方位估计的端对端方法。大多数从立体摄影机上对手姿势进行估计的现有方法都采用立体声比对,以获得深度地图,并使用深度解决方案来估计手姿势。相反,我们建议绕过立体比对,直接估计立体图像配对的3D手姿势。拟议的神经网络结构从任何关键点预测器扩大到估计手动接合点的微小差异。为了有效培训模型,我们提议了大型合成数据集,由立体图像组和地面真象3D手组成说明。实验显示,拟议的方法超过了基于立体深度的现有方法。