3D object detection with surround-view images is an essential task for autonomous driving. In this work, we propose DETR4D, a Transformer-based framework that explores sparse attention and direct feature query for 3D object detection in multi-view images. We design a novel projective cross-attention mechanism for query-image interaction to address the limitations of existing methods in terms of geometric cue exploitation and information loss for cross-view objects. In addition, we introduce a heatmap generation technique that bridges 3D and 2D spaces efficiently via query initialization. Furthermore, unlike the common practice of fusing intermediate spatial features for temporal aggregation, we provide a new perspective by introducing a novel hybrid approach that performs cross-frame fusion over past object queries and image features, enabling efficient and robust modeling of temporal information. Extensive experiments on the nuScenes dataset demonstrate the effectiveness and efficiency of the proposed DETR4D.
翻译:在这项工作中,我们提出DERTR4D,这是一个基于变异器的框架,用于探索多视图图像中3D对象探测的注意力和直接特征查询;我们设计了一个新的跨感知互动预测性跨感知机制,以解决现有方法在对交叉视图物体进行几何提示利用和信息丢失方面的局限性;此外,我们引入了一种热谱生成技术,通过查询初始化将3D和2D空间有效地连接起来。此外,与使用中间空间特征进行时间汇总的常见做法不同,我们提供了一种新的视角,即采用一种新颖的混合方法,对过去的对象查询和图像特征进行跨框架融合,使时间信息的高效和稳健的建模成为可能。关于Nuscenes数据集的广泛实验显示了提议的DETR4D的有效性和效率。