从带有等级式全关注网络的Pose-Noisy 2D 图像中学习 3D 语义学 (Learning 3D Semantics from Pose-Noisy 2D Images with Hierarchical Full Attention Network)

We propose a novel framework to learn 3D point cloud semantics from 2D multi-view image observations containing pose error. On the one hand, directly learning from the massive, unstructured and unordered 3D point cloud is computationally and algorithmically more difficult than learning from compactly-organized and context-rich 2D RGB images. On the other hand, both LiDAR point cloud and RGB images are captured in standard automated-driving datasets. This motivates us to conduct a "task transfer" paradigm so that 3D semantic segmentation benefits from aggregating 2D semantic cues, albeit pose noises are contained in 2D image observations. Among all difficulties, pose noise and erroneous prediction from 2D semantic segmentation approaches are the main challenges for the task transfer. To alleviate the influence of those factor, we perceive each 3D point using multi-view images and for each single image a patch observation is associated. Moreover, the semantic labels of a block of neighboring 3D points are predicted simultaneously, enabling us to exploit the point structure prior to further improve the performance. A hierarchical full attention network~(HiFANet) is designed to sequentially aggregates patch, bag-of-frames and inter-point semantic cues, with hierarchical attention mechanism tailored for different level of semantic cues. Also, each preceding attention block largely reduces the feature size before feeding to the next attention block, making our framework slim. Experiment results on Semantic-KITTI show that the proposed framework outperforms existing 3D point cloud based methods significantly, it requires much less training data and exhibits tolerance to pose noise. The code is available at https://github.com/yuhanghe01/HiFANet.

翻译：我们建议了一个新框架, 从 2D 多视图图像观测中学习 3D 点云语义, 包含错误。一方面, 直接从大规模、未结构化和未排序的 3D 点云中学习 3D 点云在计算上和逻辑上比从精密组织且环境丰富的 2D RGB 图像中学习要困难得多。另一方面, 在标准的自动驱动数据集中捕捉到利DAR 点云和 RGB 图像。这激励我们进行“ 任务传输” 模式, 以便3D 语义分解从汇总 2D 语义提示中受益, 尽管2D 图像观察中含有声音。在 2D 3D 点云云云云云云云云云云云云, 直接从 3D 点云云云云云云云云中直接学习 3D 。在 2D 图像观测中, 所有的难题中, 发出噪音和错误的预测都是任务转移影响。为了减轻这些因素的影响, 我们用多视图图像和每张图像显示一个三维。此外端的系统格式, 将显示。以基基基结构系统结构结构结构结构显示基结构结构显示结构, 以以以基基基基基基基基基基基显示基基基基基基结构结构结构结构结构结构结构结构显示结构结构结构结构结构结构结构。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。