We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds. The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled, and to use the underlying latent vectors as input to the perception head. The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information, that can be used to boost an actual perception task. This principle has a very simple formulation, which makes it both easy to implement and widely applicable to a large range of 3D sensors and deep networks performing semantic segmentation or object detection. In fact, it supports a single-stream pipeline, as opposed to most contrastive learning approaches, allowing training on limited resources. We conducted extensive experiments on various autonomous driving datasets, involving very different kinds of lidars, for both semantic segmentation and object detection. The results show the effectiveness of our method to learn useful representations without any annotation, compared to existing approaches. Code is available at https://github.com/valeoai/ALSO
翻译:我们提出一种新的自我监督方法,对在点云上运行的深视模型的骨干进行预先训练。核心思想是,在3D点取样的表面重建这一借口任务的基础上对模型进行培训,并将潜在的潜质矢量用作感知头的输入。直觉是,如果网络能够重建场景表面,由于输入点很少,那么它可能也捕捉到一些语义信息的碎片,可以用来促进实际的感知任务。这一原则有一个非常简单的表述方式,它使得易于实施并广泛适用于3D传感器和进行语义分解或物体探测的深层网络。事实上,它支持一条单流管道,而不是大多数对比性学习方法,允许就有限的资源进行培训。我们在许多自主驱动数据集上进行了广泛的实验,涉及非常不同的语义分解和物体探测。结果显示,与现有方法相比,我们学习有用的表达方式而没有任何注释的效果。代码可以在 https://gialbus/sionaval。