Object pose estimation is an important component of most vision pipelines for embodied agents, as well as in 3D vision more generally. In this paper we tackle the problem of estimating the pose of novel object categories in a zero-shot manner. This extends much of the existing literature by removing the need for pose-labelled datasets or category-specific CAD models for training or inference. Specifically, we make the following contributions. First, we formalise the zero-shot, category-level pose estimation problem and frame it in a way that is most applicable to real-world embodied agents. Secondly, we propose a novel method based on semantic correspondences from a self-supervised vision transformer to solve the pose estimation problem. We further re-purpose the recent CO3D dataset to present a controlled and realistic test setting. Finally, we demonstrate that all baselines for our proposed task perform poorly, and show that our method provides a six-fold improvement in average rotation accuracy at 30 degrees. Our code is available at https://github.com/applied-ai-lab/zero-shot-pose.
翻译:对象的估测是大多数含构物剂以及更广义的 3D 视图中大多数视觉管道的重要组成部分。 在本文中,我们处理以零发方式估计新物体类别构成的问题。这扩大了现有文献的大部分内容,不再需要配置标注的数据集或特定类别的 CAD 模型来进行培训或推断。 具体地说,我们做出以下贡献。 首先,我们正式确定零射、 类别级的估测问题,并以最适用于真实世界的试测剂的方式来框架它。 其次,我们提出了一个基于自我监督的视觉变异器的语义通信的新方法,以解决构成的估测问题。我们进一步重新启用最近的CO3D 数据集,以提出一个受控和现实的测试设置。 最后,我们证明我们拟议任务的所有基线运作不良,并表明我们的方法在30 度的平均旋转精度方面提供了六倍的改进。 我们的代码可在 https://github.com/aplied-ai-lab/ze-shot-position上查阅。