Query-based transformer has shown great potential in constructing long-range attention in many image-domain tasks, but has rarely been considered in LiDAR-based 3D object detection due to the overwhelming size of the point cloud data. In this paper, we propose CenterFormer, a center-based transformer network for 3D object detection. CenterFormer first uses a center heatmap to select center candidates on top of a standard voxel-based point cloud encoder. It then uses the feature of the center candidate as the query embedding in the transformer. To further aggregate features from multiple frames, we design an approach to fuse features through cross-attention. Lastly, regression heads are added to predict the bounding box on the output center feature representation. Our design reduces the convergence difficulty and computational complexity of the transformer structure. The results show significant improvements over the strong baseline of anchor-free object detection networks. CenterFormer achieves state-of-the-art performance for a single model on the Waymo Open Dataset, with 73.7% mAPH on the validation set and 75.6% mAPH on the test set, significantly outperforming all previously published CNN and transformer-based methods. Our code is publicly available at https://github.com/TuSimple/centerformer
翻译:以查询为基础的变压器在许多图像- 域任务中显示在构建远程关注方面的巨大潜力,但由于点云数据的庞大规模,在基于 LiDAR 的基于 3D 对象探测中却很少被考虑。 在本文中,我们提议建立一个基于中心的变压器网络CentralFormer, 一个基于 3D 对象探测的中心变压器网络。 CentralFormer 首先使用一个中心热映射仪来选择标准基于 voxel 的点云计算器之上的中心候选人。 然后它使用中心候选人的特性作为在变压器中的查询嵌入器。为了从多个框架进一步综合特性,我们设计了一种方法,通过交叉注意来检测引信特性。 最后,我们添加了回归头来预测输出中心特征代表的捆绑框。 我们的设计减少了变压器结构的趋同难度和计算复杂性。 结果表明,在标准无锚物体探测网络的强基线上,Centrorfer 取得了最先进的状态- 艺术性表现, Waymo Open dass set set 上有一个单一模型, 77% 和75.6 AMAPHPHAPH 在我们的测试集/ a preformaster preformax preal 上公布/ spress preal spress pass preformal supal suptal suptal supal supt s pass pass pass pass 之前公布的所有方法。