A key challenge for LiDAR-based 3D object detection is to capture sufficient features from large scale 3D scenes especially for distant or/and occluded objects. Albeit recent efforts made by Transformers with the long sequence modeling capability, they fail to properly balance the accuracy and efficiency, suffering from inadequate receptive fields or coarse-grained holistic correlations. In this paper, we propose an Octree-based Transformer, named OcTr, to address this issue. It first constructs a dynamic octree on the hierarchical feature pyramid through conducting self-attention on the top level and then recursively propagates to the level below restricted by the octants, which captures rich global context in a coarse-to-fine manner while maintaining the computational complexity under control. Furthermore, for enhanced foreground perception, we propose a hybrid positional embedding, composed of the semantic-aware positional embedding and attention mask, to fully exploit semantic and geometry clues. Extensive experiments are conducted on the Waymo Open Dataset and KITTI Dataset, and OcTr reaches newly state-of-the-art results.
翻译:LiDAR的三维物体检测的一个重要挑战是在大规模三维场景中捕捉足够的特征,尤其是对于远处或/和被遮挡的物体。尽管最近Transformer通过长序列建模能力进行努力,但它们无法正确平衡精度和效率,往往受到不足的感受野或粗糙的整体相关性的困扰。在本文中,我们提出了一种基于Octree的Transformer,称为OcTr,以解决这个问题。它首先通过在顶级上进行自注意力构造了一个动态八叉树,然后通过被八分限制地递归传播到下面的层级,从而以粗到细的方式捕捉丰富的全局上下文,同时使计算复杂度可控。此外,为了增强前景感知,我们提出了一种混合位置嵌入,由语义感知位置嵌入和注意力蒙版组成,以充分利用语义和几何线索。在Waymo Open数据集和KITTI数据集上进行了广泛的实验,OcTr取得了最新的最优结果。