With the recent growth of urban mapping and autonomous driving efforts, there has been an explosion of raw 3D data collected from terrestrial platforms with lidar scanners and color cameras. However, due to high labeling costs, ground-truth 3D semantic segmentation annotations are limited in both quantity and geographic diversity, while also being difficult to transfer across sensors. In contrast, large image collections with ground-truth semantic segmentations are readily available for diverse sets of scenes. In this paper, we investigate how to use only those labeled 2D image collections to supervise training 3D semantic segmentation models. Our approach is to train a 3D model from pseudo-labels derived from 2D semantic image segmentations using multiview fusion. We address several novel issues with this approach, including how to select trusted pseudo-labels, how to sample 3D scenes with rare object categories, and how to decouple input features from 2D images from pseudo-labels during training. The proposed network architecture, 2D3DNet, achieves significantly better performance (+6.2-11.4 mIoU) than baselines during experiments on a new urban dataset with lidar and images captured in 20 cities across 5 continents.
翻译:最近,随着城市制图和自主驱动努力的发展,从陆基平台上用利达尔扫描仪和彩色相机收集的原始 3D 数据爆炸。然而,由于标签成本高,地面的真象 3D 语义分解说明在数量和地理多样性方面都有限,同时难以跨越传感器。相比之下,大量具有地真象的语义分解的图像收集可方便地为各种场景提供。在本文中,我们研究如何只使用贴有标签的 2D 图像集来监督培训 3D 语义分解模型。我们的方法是用多视图组合来培训从 2D 语义图像分解中衍生出来的伪标签的3D 3D 模型。我们用这种方法解决了几个新颖的问题,包括如何选择信任的假标签,如何用稀有的物体类别来抽样3D 图像,以及如何在培训期间从伪标签中的2D 图像中进行分解输入。2D 拟议的网络结构,2D3D Net,2D3D Net, 实现的功能大大优于5D 城市新图像实验期间的基线(+ 6.2-11.4 mIOU)。