区域VVT:为自主驾驶的3D语义分割走向愿景变异器 (RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving)

Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, e.g., via range projection, is an effective and popular approach. These projection-based methods usually benefit from fast computations and, when combined with techniques which use other point cloud representations, achieve state-of-the-art results. Today, projection-based methods leverage 2D CNNs but recent advances in computer vision show that vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. In this work, we question if projection-based methods for 3D semantic segmentation can benefit from these latest improvements on ViTs. We answer positively but only after combining them with three key ingredients: (a) ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. By preserving the same backbone architecture as for RGB images, we can exploit the knowledge from long training on large image collections that are much cheaper to acquire and annotate than point clouds. We reach our best results with pre-trained ViTs on large image datasets. (b) We compensate ViTs' lack of inductive bias by substituting a tailored convolutional stem for the classical linear embedding layer. (c) We refine pixel-wise predictions with a convolutional decoder and a skip connection from the convolutional stem to combine low-level but fine-grained features of the the convolutional stem with the high-level but coarse predictions of the ViT encoder. With these ingredients, we show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and SemanticKITTI. We provide the implementation code at https://github.com/valeoai/rangevit.

翻译：将室外LiDAR点云作为2D问题,例如,通过射程投影,将室外LiDAR点云作为2D地段的沙文分解为2D地段问题,这是一个有效的和受欢迎的方法。这些投影方法通常会从快速计算中受益,如果与其他点云表展示技术相结合,就会取得最先进的结果。今天,投影方法会利用2DCNN,但最近计算机视觉的进步显示,许多图像基座基准中,我们的视觉变压器(ViTs)已经取得了最先进的结果。在这项工作中,我们怀疑基于投影法的3D语解剖分解方法能否从维DTs的最新改进中受益。我们的答复是积极的,但只有在结合了以下三个关键要素:(a) ViTs是臭名昭著的难于训练,需要大量培训数据来学习强大的表层。通过保留与RGB图像相同的骨架结构架构,我们可以利用关于大型图像收藏的长期培训的知识,这些都比较便宜,可以获取和记点云。我们通过经过训练的 ViLT级变校准的VT结果来取得最佳结果。我们通过大图像级的图像级的图像级变压的变压的变压的变压方法,我们没有在不断的不断的变压的变压的变压的不断的变压的变压的变压的变压的变压。