Reducing the quantity of annotations required for supervised training is vital when labels are scarce and costly. This reduction is particularly important for semantic segmentation tasks involving 3D datasets, which are often significantly smaller and more challenging to annotate than their image-based counterparts. Self-supervised pre-training on unlabelled data is one way to reduce the amount of manual annotations needed. Previous work has focused on pre-training with point clouds exclusively. While useful, this approach often requires two or more registered views. In the present work, we combine image and point cloud modalities by first learning self-supervised image features and then using these features to train a 3D model. By incorporating image data, which is often included in many 3D datasets, our pre-training method only requires a single scan of a scene and can be applied to cases where localization information is unavailable. We demonstrate that our pre-training approach, despite using single scans, achieves comparable performance to other multi-scan, point cloud-only methods.
翻译:减少监督培训所需的说明数量对于标签稀缺且成本高昂时至关重要。 减少这种减少对于涉及3D数据集的语义分解任务尤为重要, 3D数据集通常比图像对应方要小得多,对注释来说挑战性更大。 自我监督的未贴标签数据预培训是减少所需人工说明数量的一种方法。 先前的工作侧重于仅使用点云的预培训。 虽然这种方法有用,但通常需要两种或两种以上已登记的观点。 在目前的工作中,我们首先学习自我监督的图像特征,然后使用这些特征来培训3D模型,从而将图像模式和点云模式结合起来。 通过纳入图像数据(通常包含在许多3D数据集中),我们的培训前方法只需要对场景进行一次扫描,并可用于无法提供本地化信息的情况。 我们证明,我们的培训前方法尽管使用了一次扫描,但取得了与其他多扫描、点云型方法的类似性能。