3D hand pose estimation (HPE) is the process of locating the joints of the hand in 3D from any visual input. HPE has recently received an increased amount of attention due to its key role in a variety of human-computer interaction applications. Recent HPE methods have demonstrated the advantages of employing videos or multi-view images, allowing for more robust HPE systems. Accordingly, in this study, we propose a new method to perform Sequential learning with Transformer for Hand Pose (SeTHPose) estimation. Our SeTHPose pipeline begins by extracting visual embeddings from individual hand images. We then use a transformer encoder to learn the sequential context along time or viewing angles and generate accurate 2D hand joint locations. Then, a graph convolutional neural network with a U-Net configuration is used to convert the 2D hand joint locations to 3D poses. Our experiments show that SeTHPose performs well on both hand sequence varieties, temporal and angular. Also, SeTHPose outperforms other methods in the field to achieve new state-of-the-art results on two public available sequential datasets, STB and MuViHand.
翻译:3D 手形估计( HPE) 是一个从任何视觉输入中将手部的连接点定位为 3D 的过程。 HPE 近来因其在各种人-计算机互动应用中的关键作用而得到越来越多的关注。 最近的 HPE 方法展示了使用视频或多视图图像的好处, 从而可以建立更强大的 HPE 系统。 因此, 我们在本研究中提出一种新的方法, 来用变压器对手套( SeTHPose) (SeTHPose) 进行序列学习 。 我们的 seTHPose 管道开始从单个手图像中提取视觉嵌入。 我们随后使用变压器在时间或视图角度上学习相继环境, 并生成精确的 2D 手动联合位置 。 然后, 一个带有 U- Net 配置的图形进动神经网络 将 2D 联合位置转换为 3D 。 我们的实验显示, SeTHPose 在手序品种、 时间 和 角度上都很好地运行。 另外, SeTHPose 将其它方法在现场进行新的状态, 在两个公开连续数据集、 ST- HVI 和ST- B 和 TVI- B 上的新状态结果上。