Generalizable 3D part segmentation is important but challenging in vision and robotics. Training deep models via conventional supervised methods requires large-scale 3D datasets with fine-grained part annotations, which are costly to collect. This paper explores an alternative way for low-shot part segmentation of 3D point clouds by leveraging a pretrained image-language model, GLIP, which achieves superior performance on open-vocabulary 2D detection. We transfer the rich knowledge from 2D to 3D through GLIP-based part detection on point cloud rendering and a novel 2D-to-3D label lifting algorithm. We also utilize multi-view 3D priors and few-shot prompt tuning to boost performance significantly. Extensive evaluation on PartNet and PartNet-Mobility datasets shows that our method enables excellent zero-shot 3D part segmentation. Our few-shot version not only outperforms existing few-shot approaches by a large margin but also achieves highly competitive results compared to the fully supervised counterpart. Furthermore, we demonstrate that our method can be directly applied to iPhone-scanned point clouds without significant domain gaps.
翻译:3D 部分分割在视觉和机器人方面很重要,但具有挑战性。通过常规监督方法对深海模型进行培训需要大型的 3D 数据集,配有精细的批注,收集成本很高。本文探索了利用预先训练的图像语言模型GLIP对3D点云进行低射部分分割的替代方法,GLIP在公开弹道2D探测上取得了优异的性能。我们通过点云传输的GLIP 部分探测和新型的 2D 至 3D 标签提升算法,将丰富的知识从 2D 转移到 3D 3D 部分。我们还利用多视图 3D 前置和几发快速调来显著提高性能。 对 PartNet 和 PartNet 移动数据集的广泛评价表明,我们的方法可以很好地实现零射 3D 部分分割。 我们的短片版本不仅大大地超越了现有的几发方法,而且与完全监督的对应方相比也取得了高度竞争性的结果。 此外,我们证明我们的方法可以直接应用于iPhone-scned 点云,没有显著的域差距。