Real-time tracking of 3D hand pose in world space is a challenging problem and plays an important role in VR interaction. Existing work in this space are limited to either producing root-relative (versus world space) 3D pose or rely on multiple stages such as generating heatmaps and kinematic optimization to obtain 3D pose. Moreover, the typical VR scenario, which involves multi-view tracking from wide \ac{fov} cameras is seldom addressed by these methods. In this paper, we present a unified end-to-end differentiable framework for multi-view, multi-frame hand tracking that directly predicts 3D hand pose in world space. We demonstrate the benefits of end-to-end differentiabilty by extending our framework with downstream tasks such as jitter reduction and pinch prediction. To demonstrate the efficacy of our model, we further present a new large-scale egocentric hand pose dataset that consists of both real and synthetic data. Experiments show that our system trained on this dataset handles various challenging interactive motions, and has been successfully applied to real-time VR applications.
翻译:实时跟踪世界空间的3D手姿势是一个具有挑战性的问题,在VR互动中起着重要作用。在这个空间中,现有的工作局限于产生根反向(反向世界空间) 3D 成形或依赖多个阶段,例如产生热映像和运动优化以获得 3D 成形。此外,典型的VR情景,涉及从宽度的 ac{fov} 相机进行多视角跟踪,很少被这些方法所处理。在本文中,我们为多视图、多框架的多框架提供了一个可区分的多端到终端跟踪框架,直接预测3D 手在世界空间的成形。我们通过扩大我们的框架,执行下游任务,例如减少弹道和微缩预测,展示了终端到终端的稳定性的好处。为了展示我们的模型的功效,我们进一步展示了由真实和合成数据组成的新的大型自控手构成数据集。实验显示,我们在这个数据集上受过训练的系统具有挑战性的互动动作,并成功地应用于实时的VR应用程序。