While 6D object pose estimation has wide applications across computer vision and robotics, it remains far from being solved due to the lack of annotations. The problem becomes even more challenging when moving to category-level 6D pose, which requires generalization to unseen instances. Current approaches are restricted by leveraging annotations from simulation or collected from humans. In this paper, we overcome this barrier by introducing a self-supervised learning approach trained directly on large-scale real-world object videos for category-level 6D pose estimation in the wild. Our framework reconstructs the canonical 3D shape of an object category and learns dense correspondences between input images and the canonical shape via surface embedding. For training, we propose novel geometrical cycle-consistency losses which construct cycles across 2D-3D spaces, across different instances and different time steps. The learned correspondence can be applied for 6D pose estimation and other downstream tasks such as keypoint transfer. Surprisingly, our method, without any human annotations or simulators, can achieve on-par or even better performance than previous supervised or semi-supervised methods on in-the-wild images. Our project page is: https://kywind.github.io/self-pose .
翻译:虽然6D物体姿态估计在计算机视觉和机器人领域有广泛的应用,但由于缺乏注释而远未被解决。当转向类别级6D姿态时,问题变得更具挑战性,需要对未见过的实例进行泛化。当前的方法受到了来自模拟或人类收集的注释的限制。在本文中,我们通过引入一种自我监督学习方法,用于直接在大规模真实物体视频上进行野外类别级6D物体姿态估计,克服了这个障碍。我们的框架通过表面嵌入重构对象类别的规范3D形状,并学习输入图像和规范形状之间的密集对应关系。对于训练,我们提出了新的几何循环一致性损失,通过2D-3D空间、不同实例和不同时间步骤之间构建循环。学习到的对应关系可以应用于6D姿态估计和其他下游任务,例如关键点传输。令人惊讶的是,我们的方法,没有任何人类注释或模拟器,可以在野外图像上达到与之前的受监督或半监督方法相当甚至更好的性能。我们的项目页面是:https://kywind.github.io/self-pose 。