State-of-the-art 3D detection methods rely on supervised learning and large labelled datasets. However, annotating lidar data is resource-consuming, and depending only on supervised learning limits the applicability of trained models. Against this backdrop, here we propose using a self-supervised training strategy to learn a general point cloud backbone model for downstream 3D vision tasks. 3D scene flow can be estimated with self-supervised learning using cycle consistency, which removes labelled data requirements. Moreover, the perception of objects in the traffic scenarios heavily relies on making sense of the sparse data in the spatio-temporal context. Our main contribution leverages learned flow and motion representations and combines a self-supervised backbone with a 3D detection head focusing mainly on the relation between the scene flow and detection tasks. In this way, self-supervised scene flow training constructs point motion features in the backbone, which help distinguish objects based on their different motion patterns used with a 3D detection head. Experiments on KITTI and nuScenes benchmarks show that the proposed self-supervised pre-training increases 3D detection performance significantly.
翻译:3D状态探测方法依赖于监督学习和大标记数据集。然而,注意Lidar数据需要大量资源,而且仅取决于监督学习限制了受过培训的模型的适用性。在这种背景下,我们提议使用自我监督的培训战略,学习一个用于下游3D视觉任务的一般点云柱模型。3D场景流动可以通过使用循环一致性的自我监督学习来估计,这可以消除贴有标签的数据要求。此外,交通情景中物体的认知在很大程度上依赖于理解spatio-时空环境中的稀少数据。我们的主要贡献利用了学习流和运动演示,并将自监督骨干与3D探测任务相结合,主要侧重于现场流动和探测任务之间的关系。这样,自我监督的场景流动培训在骨干中建立点运动特征,这有助于根据使用3D探测头的不同运动模式区分物体。对KITTI和nuScenes基准的实验表明,拟议的自我监督前训练前性能大大提高了3D探测性。