Hand pose estimation (HPE) can be used for a variety of human-computer interaction applications such as gesture-based control for physical or virtual/augmented reality devices. Recent works have shown that videos or multi-view images carry rich information regarding the hand, allowing for the development of more robust HPE systems. In this paper, we present the Multi-View Video-Based 3D Hand (MuViHand) dataset, consisting of multi-view videos of the hand along with ground-truth 3D pose labels. Our dataset includes more than 402,000 synthetic hand images available in 4,560 videos. The videos have been simultaneously captured from six different angles with complex backgrounds and random levels of dynamic lighting. The data has been captured from 10 distinct animated subjects using 12 cameras in a semi-circle topology where six tracking cameras only focus on the hand and the other six fixed cameras capture the entire body. Next, we implement MuViHandNet, a neural pipeline consisting of image encoders for obtaining visual embeddings of the hand, recurrent learners to learn both temporal and angular sequential information, and graph networks with U-Net architectures to estimate the final 3D pose information. We perform extensive experiments and show the challenging nature of this new dataset as well as the effectiveness of our proposed method. Ablation studies show the added value of each component in MuViHandNet, as well as the benefit of having temporal and sequential information in the dataset.
翻译:手形估计( HHPE) 可用于各种人体-计算机互动应用, 如物理或虚拟/放大现实装置的手势控制( HPE) 。 最近的工作显示, 视频或多视图图像包含与手有关的丰富信息, 从而可以开发更强大的 HPE 系统。 在本文中, 我们展示了多视视频 3D Hand ( MuviHand) 数据集, 由多视视频和地面真相 3D 的标签组成。 我们的数据集包含4 560 视频中的402 000多张合成手图象。 视频是从六个不同角度同时拍摄的, 具有复杂的背景和随机动态照明水平。 这些数据是从10个不同的有色对象采集的, 使用半圆形表层的 12个相机 。 其中6个跟踪相机只以手为焦点, 而其他6个固定相机则捕捉整个身体。 其次, 我们实施了 MuViHandNet, 一个由图像摄像导管组成的神经管道, 包括手的视觉嵌。 最后的学习者从6个不同角度和直径相序列信息中学习, 最后的学习者们的顺序信息, 和图形网络的每个都有时间和直径顺序的网络的实验,, 以UNet 的图性 显示我们的数据结构结构结构结构结构结构结构结构结构的图图,, 显示我们所的图象学的图图图图的图的图的图的图, 。