3D object detection is a significant task for autonomous driving. Recently with the progress of vision transformers, the 2D object detection problem is being treated with the set-to-set loss. Inspired by these approaches on 2D object detection and an approach for multi-view 3D object detection DETR3D, we propose MSF3DDETR: Multi-Sensor Fusion 3D Detection Transformer architecture to fuse image and LiDAR features to improve the detection accuracy. Our end-to-end single-stage, anchor-free and NMS-free network takes in multi-view images and LiDAR point clouds and predicts 3D bounding boxes. Firstly, we link the object queries learnt from data to the image and LiDAR features using a novel MSF3DDETR cross-attention block. Secondly, the object queries interacts with each other in multi-head self-attention block. Finally, MSF3DDETR block is repeated for $L$ number of times to refine the object queries. The MSF3DDETR network is trained end-to-end on the nuScenes dataset using Hungarian algorithm based bipartite matching and set-to-set loss inspired by DETR. We present both quantitative and qualitative results which are competitive to the state-of-the-art approaches.
翻译:3D 对象探测是自动驱动的一个重要任务。 最近,随着视觉变压器的进步, 2D 对象探测问题正在以设定到设定损失的方式处理。 在2D 对象探测和多视图 3D 对象探测 DETR3DDDD 方法的启发下, 我们建议 MSF3DDETER: 多传感器整合 3D 检测变异器结构, 使图像和LIDAR 功能融合, 以提高检测准确性。 我们的端到端单级、 无锚和无NMS 网络在多视图图像中被处理, 以及激光雷达点云和预测 3DD的捆绑框。 首先, 我们利用一个全新的 MSF3DDDETR 对象探测和立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立,, 立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立立,, 。