3D hand-object pose estimation is the key to the success of many computer vision applications. The main focus of this task is to effectively model the interaction between the hand and an object. To this end, existing works either rely on interaction constraints in a computationally-expensive iterative optimization, or consider only a sparse correlation between sampled hand and object keypoints. In contrast, we propose a novel dense mutual attention mechanism that is able to model fine-grained dependencies between the hand and the object. Specifically, we first construct the hand and object graphs according to their mesh structures. For each hand node, we aggregate features from every object node by the learned attention and vice versa for each object node. Thanks to such dense mutual attention, our method is able to produce physically plausible poses with high quality and real-time inference speed. Extensive quantitative and qualitative experiments on large benchmark datasets show that our method outperforms state-of-the-art methods. The code is available at https://github.com/rongakowang/DenseMutualAttention.git.
翻译:3D 手向对象显示估计是许多计算机视觉应用程序成功的关键。 此任务的主要焦点是有效地模拟手与对象之间的互动。 为此, 现有的作品要么依靠计算成本昂贵的迭代优化中的交互约束, 要么只考虑抽样手与对象键点之间少许的关联。 相反, 我们提议了一个新的密集的相互注意机制, 能够模拟手与对象之间的细微差异依赖性。 具体地说, 我们首先根据网状结构构建手和对象图。 对于每个手节点, 我们通过学习到的注意将每个对象节点的特征汇总起来, 而对于每个对象节点反之亦然。 由于这种密集的相互关注, 我们的方法能够产生出质量高和实时推断速度高的物理合理性。 在大型基准数据集上进行的广泛定量和定性实验显示, 我们的方法超越了状态- 艺术方法。 代码可在 https://github.com/rongawang/ DeseMutualAttenda. git. 上查阅 。