We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds. The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled, and to use the underlying latent vectors as input to the perception head. The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information, that can be used to boost an actual perception task. This principle has a very simple formulation, which makes it both easy to implement and widely applicable to a large range of 3D sensors and deep networks performing semantic segmentation or object detection. In fact, it supports a single-stream pipeline, as opposed to most contrastive learning approaches, allowing training on limited resources. We conducted extensive experiments on various autonomous driving datasets, involving very different kinds of lidars, for both semantic segmentation and object detection. The results show the effectiveness of our method to learn useful representations without any annotation, compared to existing approaches. Code is available at https://github.com/valeoai/ALSO
翻译:我们提出了一种新的自监督方法,用于预训练操作点云的深度感知模型的主干。核心思想是训练模型在预文本任务中重建样本3D点所在的表面,并使用潜在向量作为感知头的输入。直觉是如果网络能够仅依靠稀疏输入点重建场景表面,则它可能还捕获了一些语义信息的片段,可用于增强实际感知任务。这个原则有一个非常简单的公式,使它易于实施并能广泛适用于许多3D传感器和执行语义分割或目标检测的深度网络。事实上,它支持单流流水线,而不是大多数对比学习方法,可以在有限资源上进行培训。我们在各种自动驾驶数据集上进行了广泛的实验,涉及非常不同类型的激光雷达,包括语义分割和目标检测。与现有方法相比,结果显示出我们的方法在没有任何注释的情况下学习有用的表示的有效性。代码可以在 https://github.com/valeoai/ALSO 上找到。