Reducing the quantity of annotations required for supervised training is vital when labels are scarce and costly. This reduction is especially important for semantic segmentation tasks involving 3D datasets that are often significantly smaller and more challenging to annotate than their image-based counterparts. Self-supervised pre-training on large unlabelled datasets is one way to reduce the amount of manual annotations needed. Previous work has focused on pre-training with point cloud data exclusively; this approach often requires two or more registered views. In the present work, we combine image and point cloud modalities, by first learning self-supervised image features and then using these features to train a 3D model. By incorporating image data, which is often included in many 3D datasets, our pre-training method only requires a single scan of a scene. We demonstrate that our pre-training approach, despite using single scans, achieves comparable performance to other multi-scan, point cloud-only methods.
翻译:减少监督培训所需的说明数量对于标签稀缺且成本高昂时至关重要。 减少这种减少对于涉及3D数据集的语义分割任务尤为重要, 3D数据集通常比图像类数据集要小得多,而且比图像类数据集更难批注。 对大型无标签数据集进行自我监督的预先培训是减少所需手动说明数量的一种方法。 以前的工作侧重于仅使用点云数据的培训前培训; 这种方法通常需要两种或两种以上已登记的观点。 在目前的工作中, 我们通过先学习自我监督的图像特征,然后使用这些特征来培训3D模型, 将图像数据(通常包含在许多3D数据集中)纳入培训前方法, 只需对场景进行一次扫描即可。 我们证明,我们的培训前方法尽管使用单次扫描,但取得了与其他多扫描、点云方法的类似性能。