Analyzing and understanding hand information from multimedia materials like images or videos is important for many real world applications and remains active in research community. There are various works focusing on recovering hand information from single image, however, they usually solve a single task, for example, hand mask segmentation, 2D/3D hand pose estimation, or hand mesh reconstruction and perform not well in challenging scenarios. To further improve the performance of these tasks, we propose a novel Hand Image Understanding (HIU) framework to extract comprehensive information of the hand object from a single RGB image, by jointly considering the relationships between these tasks. To achieve this goal, a cascaded multi-task learning (MTL) backbone is designed to estimate the 2D heat maps, to learn the segmentation mask, and to generate the intermediate 3D information encoding, followed by a coarse-to-fine learning paradigm and a self-supervised learning strategy. Qualitative experiments demonstrate that our approach is capable of recovering reasonable mesh representations even in challenging situations. Quantitatively, our method significantly outperforms the state-of-the-art approaches on various widely-used datasets, in terms of diverse evaluation metrics.
翻译:从多媒体材料(如图像或视频)中分析和理解手信息对于许多真实世界应用十分重要,在研究界中仍然很活跃。有各种工作侧重于从单一图像中回收手信息,然而,它们通常解决单项任务,例如:手面罩分割,2D/3D手构成估计,或手网状重建,在具有挑战性的情景中表现不佳。为进一步改进这些任务的执行情况,我们提议了一个新型手面图理解框架,通过共同考虑这些任务之间的关系,从单一的RGB图像中提取手面物体的全面信息。为了实现这一目标,设计了一个串联多任务学习主干骨,以估计2D热图,学习分割面纱,并生成中间的3D信息编码,随后采用粗到松的学习模式和自我监督的学习战略。定性实验表明,我们的方法即使在具有挑战性的情况下,也能从一个单一的 RGB图像中恢复合理的中观表现。从数量上看,我们的方法大大超出各种广泛使用的数据集的状态方法。