Estimating 3D hand and object pose from a single image is an extremely challenging problem: hands and objects are often self-occluded during interactions, and the 3D annotations are scarce as even humans cannot directly label the ground-truths from a single image perfectly. To tackle these challenges, we propose a unified framework for estimating the 3D hand and object poses with semi-supervised learning. We build a joint learning framework where we perform explicit contextual reasoning between hand and object representations by a Transformer. Going beyond limited 3D annotations in a single image, we leverage the spatial-temporal consistency in large-scale hand-object videos as a constraint for generating pseudo labels in semi-supervised learning. Our method not only improves hand pose estimation in challenging real-world dataset, but also substantially improve the object pose which has fewer ground-truths per instance. By training with large-scale diverse videos, our model also generalizes better across multiple out-of-domain datasets. Project page and code: https://stevenlsw.github.io/Semi-Hand-Object
翻译:从单一图像中估算 3D 手和对象的形状是一个极具挑战性的问题:在互动期间,手和对象往往自我封闭,而3D 说明则很稀少,因为人类甚至不能从单一图像中直接直接标出地面真相。为了应对这些挑战,我们提议了一个统一框架,用半监督的学习来估计3D 手和对象的形状。我们建立了一个联合学习框架,在这个框架中,我们用一个变形器对手和对象的表示方式进行明确的背景推理。超越了在单一图像中有限的3D 说明,我们利用大型手拍视频中的空间时空一致性作为在半超级学习中生成假标签的制约。我们的方法不仅改进了挑战性真实世界数据集中的手势估计,而且还大大改进了物体的外观。我们用大型不同视频进行的培训,我们的模型还把多个外向数据集中的模型都更好归纳。项目页和代码: https://stevenlsw.github.io/Semi-Hand-Object: