Recent advances in 3D semantic segmentation with deep neural networks have shown remarkable success, with rapid performance increase on available datasets. However, current 3D semantic segmentation benchmarks contain only a small number of categories -- less than 30 for ScanNet and SemanticKITTI, for instance, which are not enough to reflect the diversity of real environments (e.g., semantic image understanding covers hundreds to thousands of classes). Thus, we propose to study a larger vocabulary for 3D semantic segmentation with a new extended benchmark on ScanNet data with 200 class categories, an order of magnitude more than previously studied. This large number of class categories also induces a large natural class imbalance, both of which are challenging for existing 3D semantic segmentation methods. To learn more robust 3D features in this context, we propose a language-driven pre-training method to encourage learned 3D features that might have limited training examples to lie close to their pre-trained text embeddings. Extensive experiments show that our approach consistently outperforms state-of-the-art 3D pre-training for 3D semantic segmentation on our proposed benchmark (+9% relative mIoU), including limited-data scenarios with +25% relative mIoU using only 5% annotations.
翻译:3D 语义分解与深层神经网络最近的进展显示了显著的成功,现有数据集的性能迅速提高。然而,目前的3D 语义分解基准仅包含少量类别 -- -- 例如,扫描网和SemanticKITTI的类别不到30个,不足以反映真实环境的多样性(例如,语义图像理解涵盖数百至数千类)。因此,我们提议研究3D 语义分解的更大型词汇表,在ScANNet数据上采用新的扩展基准,即200类的扫描网数据,其数量比以前研究的要大。这种数量众多的类别基准还造成大量的自然类不平衡,对现有的3D 语义分解方法都是挑战的。要学习更健全的3D特性,我们建议采用一种语言驱动的训练前方法,鼓励学习与经过预先培训的文本嵌入的仅有有限的培训实例。广泛的实验表明,我们的方法始终超越了3D 前3D 3D 级分解分析的状态,包括仅使用3D% 的相对图解解的3D 仅使用我们提议的5D 的Mmanticalimal 的模型的MI 。