An effective framework for learning 3D representations for perception tasks is distilling rich self-supervised image features via contrastive learning. However, image-to point representation learning for autonomous driving datasets faces two main challenges: 1) the abundance of self-similarity, which results in the contrastive losses pushing away semantically similar point and image regions and thus disturbing the local semantic structure of the learned representations, and 2) severe class imbalance as pretraining gets dominated by over-represented classes. We propose to alleviate the self-similarity problem through a novel semantically tolerant image-to-point contrastive loss that takes into consideration the semantic distance between positive and negative image regions to minimize contrasting semantically similar point and image regions. Additionally, we address class imbalance by designing a class-agnostic balanced loss that approximates the degree of class imbalance through an aggregate sample-to-samples semantic similarity measure. We demonstrate that our semantically-tolerant contrastive loss with class balancing improves state-of-the art 2D-to-3D representation learning in all evaluation settings on 3D semantic segmentation. Our method consistently outperforms state-of-the-art 2D-to-3D representation learning frameworks across a wide range of 2D self-supervised pretrained models.
翻译:学习3D感知任务的有效框架是通过对比学习提取丰富的自我监督图像特征进行图像到点表示。然而,针对自动驾驶数据集的图像到点表示学习面临两个主要挑战:1)自相似性丰富,导致对比损失将语义相似的点和图像区域推开,从而扰乱了所学表示的局部语义结构,2)预训练过程中出现严重类别失衡,使得过度呈现的类别占主导地位。我们提出通过一种新颖的“基于语义容忍的图像到点对比损失”,缓解了自相似性问题;该损失考虑了正负图像区域之间的语义距离,以最小化对比语义相似的点和图像区域。此外,我们通过设计一个“类别不可知”的平衡损失,通过聚合样本到样本的语义相似度,近似度量类别失衡程度。我们证明,我们的基于语义容忍的对比损失和类别均衡化是所有3D语义分割评估方案中改善2D到3D表示学习的最新方法。在各种2D自我监督预训练模型的广泛范围内,我们的方法始终优于最先进的2D到3D表示学习框架。