We present a simple and effective method for 3D hand pose estimation from a single depth frame. As opposed to previous state-of-the-art methods based on holistic 3D regression, our method works on dense pixel-wise estimation. This is achieved by careful design choices in pose parameterization, which leverages both 2D and 3D properties of depth map. Specifically, we decompose the pose parameters into a set of per-pixel estimations, i.e., 2D heat maps, 3D heat maps and unit 3D directional vector fields. The 2D/3D joint heat maps and 3D joint offsets are estimated via multi-task network cascades, which is trained end-to-end. The pixel-wise estimations can be directly translated into a vote casting scheme. A variant of mean shift is then used to aggregate local votes while enforcing consensus between the the estimated 3D pose and the pixel-wise 2D and 3D estimations by design. Our method is efficient and highly accurate. On MSRA and NYU hand dataset, our method outperforms all previous state-of-the-art approaches by a large margin. On the ICVL hand dataset, our method achieves similar accuracy compared to the currently proposed nearly saturated result and outperforms various other proposed methods. Code is available $\href{"https://github.com/melonwan/denseReg"}{\text{online}}$.
翻译:我们从一个深度框架为 3D 手提供了一种简单而有效的方法。 与以前基于全方位 3D 回归的最先进方法相反, 我们的方法是使用密集像素的估测。 这是通过配置参数的仔细设计选择实现的。 它利用了深度地图 2D 和 3D 的属性。 具体地说, 我们将这些成像参数分解成一套每像素的估测, 即 2D 热图、 3D 热图和 3D 单位 方向矢量字段。 2D/3D 联合热图和 3D 联合抵消是通过多塔斯克 网络级联算估算的, 受过培训的终端到端。 将像素的估测直接转换成一个投票投影方案。 然后, 一种平均变换的变法用于综合当地选票, 同时通过设计执行估计的 3D 和 pixel 向 2D 3D 和 3D 的估测。 我们的方法既高效又非常精确。 在 MSRA 和 NU 手 数据设置上, 我们的方法超越了所有以前的状态/ Rueber 。 通过一个类似于的基值 的近点 方法, 通过一个大平面的 。