Despite the recent efforts in accurate 3D annotations in hand and object datasets, there still exist gaps in 3D hand and object reconstructions. Existing works leverage contact maps to refine inaccurate hand-object pose estimations and generate grasps given object models. However, they require explicit 3D supervision which is seldom available and therefore, are limited to constrained settings, e.g., where thermal cameras observe residual heat left on manipulated objects. In this paper, we propose a novel semi-supervised framework that allows us to learn contact from monocular images. Specifically, we leverage visual and geometric consistency constraints in large-scale datasets for generating pseudo-labels in semi-supervised learning and propose an efficient graph-based network to infer contact. Our semi-supervised learning framework achieves a favourable improvement over the existing supervised learning methods trained on data with `limited' annotations. Notably, our proposed model is able to achieve superior results with less than half the network parameters and memory access cost when compared with the commonly-used PointNet-based approach. We show benefits from using a contact map that rules hand-object interactions to produce more accurate reconstructions. We further demonstrate that training with pseudo-labels can extend contact map estimations to out-of-domain objects and generalise better across multiple datasets.
翻译:尽管最近在手上和物体数据集的准确3D说明方面做出了努力,但在3D手表和物体重建方面仍然存在差距。现有的工作利用接触地图来改进不准确的手用物体显示的估算和生成给定物体模型。然而,它们需要明确的3D监督,而这种监督很少,因此只限于有限的环境,例如,热照相机观察被操纵物体上的剩余热量。在本文件中,我们提出了一个新的半监督框架,使我们能够从单眼图像中学习接触。具体地说,我们利用在大型数据集中的视觉和几何一致性限制,在半受监督的学习中生成假标签,并提出高效的图形网络来推断接触。我们的半受监督的学习框架在以“有限”注解数据培训的现有受监督的学习方法上取得了有利的改进。值得注意的是,我们提议的模型能够取得优异的结果,与常用的点网基方法相比,网络参数和记忆存取成本不到一半。我们展示了使用接触地图的好处,即使用规则手点互动可以产生更精确的图像,从而得出更精确的模拟的模型。我们进一步展示了与一般的链接。