While 6D object pose estimation has wide applications across computer vision and robotics, it remains far from being solved due to the lack of annotations. The problem becomes even more challenging when moving to category-level 6D pose, which requires generalization to unseen instances. Current approaches are restricted by leveraging annotations from simulation or collected from humans. In this paper, we overcome this barrier by introducing a self-supervised learning approach trained directly on large-scale real-world object videos for category-level 6D pose estimation in the wild. Our framework reconstructs the canonical 3D shape of an object category and learns dense correspondences between input images and the canonical shape via surface embedding. For training, we propose novel geometrical cycle-consistency losses which construct cycles across 2D-3D spaces, across different instances and different time steps. The learned correspondence can be applied for 6D pose estimation and other downstream tasks such as keypoint transfer. Surprisingly, our method, without any human annotations or simulators, can achieve on-par or even better performance than previous supervised or semi-supervised methods on in-the-wild images. Our project page is: https://kywind.github.io/self-pose .
翻译:虽然 6D 对象的估测在计算机视觉和机器人方面有着广泛的应用, 但由于缺少说明, 它远没有得到解决。 当迁移到6D 类别时, 问题变得更加棘手, 需要概括到不可见的事例。 目前的方法受到来自模拟或从人类收集的批注的限制。 在本文件中, 我们通过引入一个自我监督的学习方法来克服这一障碍, 在大规模真实的6D 类物体视频上直接培训的大规模真实的6D 类图像是野外的估测。 我们的框架重建了一个对象类别的3D 型形, 并通过表面嵌入学习输入图像和罐形形状之间的密集对应。 为了培训, 我们提出了新的几何周期-周期一致性损失, 以在不同的情况和时间步骤构建2D-3D 空间的周期。 学到的对应方法可以用于6D 构成估测, 和其他下游任务, 如关键点转移。 令人惊讶的是, 我们的方法, 没有人类的批注或模拟器, 就可以在地面嵌入式图像上取得更好的业绩, 甚至比以前监督的或半监视方法更好。 我们的项目是: http/ 自我定位 。 。