Detection and tracking of moving objects (DATMO) is an essential component in environmental perception for autonomous driving. In the flourishing field of multi-view 3D camera-based detectors, different transformer-based pipelines are designed to learn queries in 3D space from 2D feature maps of perspective views, but the dominant dense cross-attention mechanism between queries to values is computationally inefficient. This paper proposes Sparse R-CNN 3D (SRCN3D), a novel two-stage fully-sparse detector with sparse queries, sparse attention and sparse prediction for surround-view camera detection and tracking. SRCN3D adopts a cascade structure with twin-track update of both fixed number of proposal boxes and latent proposal features. Compared to prior arts, our novel sparse feature sampling module only utilizes local 2D region of interest (RoI) features calculated by projection of 3D proposal boxes for further box refinement, leading to an effective, fast and lightweight pipeline. For multi-object tracking, motion features, proposal features and RoI features are comprehensively utilized in multi-hypotheses data association. Extensive experiments on nuScenes dataset demonstrate that SRCN3D achieves competitive performance in object detection and surpasses previous best arts before 2022.08.09 in camera-only multi-object tracking by more than 10 points in terms of AMOTA metric. Code is available at https://github.com/synsin0/SRCN3D.
翻译:移动天体(DATMO)的探测和跟踪是自动驾驶环境感知的一个必不可少的组成部分。在多视3D摄像基探测器的蓬勃发展领域,基于不同变压器的管道的设计是为了从2D视角视图地貌图中学习3D空间的查询,但是,对数值的查询中主要密集的交叉注意机制在计算上是效率低下的。本文提议采用Sprass R-CNN 3D(SRCN3D),这是一个新型的两阶段完全失明的探测器,其查询少,注意力少,对环视相机探测和跟踪的预测少。SRCNCN3D采用一个双轨结构,同时更新固定数目的提案框和潜在提案特征。与以前的艺术相比,我们新的稀释地特征取样模块仅利用当地2D感兴趣的区域(RoI),其计算方法是预测3D建议箱的进一步改进,导致有效、快速和轻量的管道。多目标跟踪、动作特征、建议特征和RoI特征在多功能数据协会中得到了全面利用。在NCNNRCM3 208号摄像机前的大规模实验,在前的SBS-BS-RODSM3中,通过以往的检测,在10MMRMMM3中以更具有最佳性的方式跟踪性能。