3D semantic occupancy prediction offers an intuitive and efficient scene understanding and has attracted significant interest in autonomous driving perception. Existing approaches either rely on full supervision, which demands costly voxel-level annotations, or on self-supervision, which provides limited guidance and yields suboptimal performance. To address these challenges, we propose OccLE, a Label-Efficient 3D Semantic Occupancy Prediction that takes images and LiDAR as inputs and maintains high performance with limited voxel annotations. Our intuition is to decouple the semantic and geometric learning tasks and then fuse the learned feature grids from both tasks for the final semantic occupancy prediction. Therefore, the semantic branch distills 2D foundation model to provide aligned pseudo labels for 2D and 3D semantic learning. The geometric branch integrates image and LiDAR inputs in cross-plane synergy based on their inherency, employing semi-supervision to enhance geometry learning. We fuse semantic-geometric feature grids through Dual Mamba and incorporate a scatter-accumulated projection to supervise unannotated prediction with aligned pseudo labels. Experiments show that OccLE achieves competitive performance with only 10\% of voxel annotations on the SemanticKITTI and Occ3D-nuScenes datasets. The code will be publicly released on https://github.com/NerdFNY/OccLE
翻译:三维语义占据预测提供了一种直观高效的场景理解方式,在自动驾驶感知领域引起了广泛关注。现有方法要么依赖全监督,需要昂贵的体素级标注,要么采用自监督,其提供的指导有限且性能欠佳。为解决这些挑战,我们提出了OccLE,一种标签高效的三维语义占据预测方法,以图像和激光雷达作为输入,在有限的体素标注下仍能保持高性能。我们的核心思路是将语义学习任务与几何学习任务解耦,然后将两个任务学习到的特征网格进行融合,以完成最终的语义占据预测。具体而言,语义分支通过蒸馏二维基础模型,为二维和三维语义学习提供对齐的伪标签。几何分支则基于图像与激光雷达的内在特性,在跨平面协同中整合两者输入,并采用半监督策略增强几何学习。我们通过双Mamba模块融合语义-几何特征网格,并引入散射累积投影方法,利用对齐的伪标签对未标注的预测进行监督。实验表明,在SemanticKITTI和Occ3D-nuScenes数据集上,OccLE仅需10%的体素标注即可达到具有竞争力的性能。代码将在https://github.com/NerdFNY/OccLE 公开。