Object manipulation requires accurate object pose estimation. In open environments, robots encounter unknown objects, which requires semantic understanding in order to generalize both to known categories and beyond. To resolve this challenge, we present SCOPE, a diffusion-based category-level object pose estimation model that eliminates the need for discrete category labels by leveraging DINOv2 features as continuous semantic priors. By combining these DINOv2 features with photorealistic training data and a noise model for point normals, we reduce the Sim2Real gap in category-level object pose estimation. Furthermore, injecting the continuous semantic priors via cross-attention enables SCOPE to learn canonicalized object coordinate systems across object instances beyond the distribution of known categories. SCOPE outperforms the current state of the art in synthetically trained category-level object pose estimation, achieving a relative improvement of 31.9\% on the 5$^\circ$5cm metric. Additional experiments on two instance-level datasets demonstrate generalization beyond known object categories, enabling grasping of unseen objects from unknown categories with a success rate of up to 100\%. Code available: https://github.com/hoenigpeter/scope.
翻译:物体操作需要精确的物体姿态估计。在开放环境中,机器人会遇到未知物体,这需要语义理解能力才能泛化至已知类别乃至更广范围。为解决这一挑战,我们提出了SCOPE——一种基于扩散模型的类别级物体姿态估计方法,该方法通过利用DINOv2特征作为连续语义先验,无需依赖离散的类别标签。通过将DINOv2特征与逼真的训练数据及点云法向量的噪声模型相结合,我们缩小了类别级物体姿态估计中的Sim2Real差距。此外,通过交叉注意力机制注入连续语义先验,使得SCOPE能够学习跨物体实例的规范化物体坐标系,其泛化能力可超越已知类别的分布范围。SCOPE在合成数据训练的类别级物体姿态估计任务中超越了当前最优方法,在5$^\circ$5cm度量标准上实现了31.9\%的相对性能提升。在两个实例级数据集上的补充实验表明,该方法能够泛化至已知物体类别之外,对未知类别的新物体抓取成功率最高可达100\%。代码已开源:https://github.com/hoenigpeter/scope。