The dominant multi-camera 3D detection paradigm is based on explicit 3D feature construction, which requires complicated indexing of local image-view features via 3D-to-2D projection. Other methods implicitly introduce geometric positional encoding and perform global attention (e.g., PETR) to build the relationship between image tokens and 3D objects. The 3D-to-2D perspective inconsistency and global attention lead to a weak correlation between foreground tokens and queries, resulting in slow convergence. We propose Focal-PETR with instance-guided supervision and spatial alignment module to adaptively focus object queries on discriminative foreground regions. Focal-PETR additionally introduces a down-sampling strategy to reduce the consumption of global attention. Due to the highly parallelized implementation and down-sampling strategy, our model, without depth supervision, achieves leading performance on the large-scale nuScenes benchmark and a superior speed of 30 FPS on a single RTX3090 GPU. Extensive experiments show that our method outperforms PETR while consuming 3x fewer training hours. The code will be made publicly available.
翻译:多相机 3D 检测模式以明确的 3D 特征构造为基础,这要求通过 3D 到 2D 投影,对本地图像视图特征进行复杂的索引化。其他方法含蓄地引入几何定位编码并引起全球注意(例如PETR),以建立图像符号和 3D 对象之间的关系。3D 到 2D 视角的不一致和全球注意导致地表标识和查询之间的相关性薄弱,导致缓慢的趋同。我们提议以实例指导的监督和空间校准模块为焦点,对歧视性地表区域进行适应性的聚焦对象查询。Council-PETR还引入了下标战略,以减少全球注意力的消耗量。由于高度平行的执行和下标战略,我们的模型在没有深度监督的情况下,在大型核标本基准上取得领先性成绩,而在单个 RTX3090 GPU 上达到30FPS 的超速率。广泛的实验显示,我们的方法在消耗3x 小时的训练中比PTR 。