We propose a novel framework to learn 3D point cloud semantics from 2D multi-view image observations containing pose error. On the one hand, directly learning from the massive, unstructured and unordered 3D point cloud is computationally and algorithmically more difficult than learning from compactly-organized and context-rich 2D RGB images. On the other hand, both LiDAR point cloud and RGB images are captured in standard automated-driving datasets. This motivates us to conduct a "task transfer" paradigm so that 3D semantic segmentation benefits from aggregating 2D semantic cues, albeit pose noises are contained in 2D image observations. Among all difficulties, pose noise and erroneous prediction from 2D semantic segmentation approaches are the main challenges for the task transfer. To alleviate the influence of those factor, we perceive each 3D point using multi-view images and for each single image a patch observation is associated. Moreover, the semantic labels of a block of neighboring 3D points are predicted simultaneously, enabling us to exploit the point structure prior to further improve the performance. A hierarchical full attention network~(HiFANet) is designed to sequentially aggregates patch, bag-of-frames and inter-point semantic cues, with hierarchical attention mechanism tailored for different level of semantic cues. Also, each preceding attention block largely reduces the feature size before feeding to the next attention block, making our framework slim. Experiment results on Semantic-KITTI show that the proposed framework outperforms existing 3D point cloud based methods significantly, it requires much less training data and exhibits tolerance to pose noise. The code is available at https://github.com/yuhanghe01/HiFANet.
翻译:我们建议了一个新框架, 从 2D 多视图图像观测中学习 3D 点云语义, 包含错误。 一方面, 直接从大规模、 未结构化和未排序的 3D 点云中学习 3D 点云在计算上和逻辑上比从精密组织且环境丰富的 2D RGB 图像中学习要困难得多。 另一方面, 在标准的自动驱动数据集中捕捉到利DAR 点云和 RGB 图像。 这激励我们进行“ 任务传输” 模式, 以便3D 语义分解从汇总 2D 语义提示中受益, 尽管2D 图像观察中含有声音。 在 2D 3D 点云云云云云云云云云云云云, 直接从 3D 点云云云云云云云云中直接学习 3D 。 在 2D 图像观测中, 所有的难题中, 发出噪音和错误的预测都是任务转移影响。 为了减轻这些因素的影响, 我们用多视图图像和每张图像显示一个三维 。 此外端 的系统 格式, 将显示 。 以 基 基 基 结构 系统 结构 结构 结构 结构 显示 基 结构 结构 显示 结构, 以 以 以 基 基 基 基 基 基 基 基 基 基 基 显示 基 基 基 基 基 基 结构 结构 结构 结构 结构 结构 结构 结构 显示 结构 结构 结构 结构 结构 结构 结构 。