Inspired by recent advances in vision transformers for object detection, we propose Li3DeTr, an end-to-end LiDAR based 3D Detection Transformer for autonomous driving, that inputs LiDAR point clouds and regresses 3D bounding boxes. The LiDAR local and global features are encoded using sparse convolution and multi-scale deformable attention respectively. In the decoder head, firstly, in the novel Li3DeTr cross-attention block, we link the LiDAR global features to 3D predictions leveraging the sparse set of object queries learnt from the data. Secondly, the object query interactions are formulated using multi-head self-attention. Finally, the decoder layer is repeated $L_{dec}$ number of times to refine the object queries. Inspired by DETR, we employ set-to-set loss to train the Li3DeTr network. Without bells and whistles, the Li3DeTr network achieves 61.3% mAP and 67.6% NDS surpassing the state-of-the-art methods with non-maximum suppression (NMS) on the nuScenes dataset and it also achieves competitive performance on the KITTI dataset. We also employ knowledge distillation (KD) using a teacher and student model that slightly improves the performance of our network.
翻译:受天体探测视觉变压器最近进展的启发, 我们提议使用基于 3DAR 的 3D 终端到终端的 3D 检测变异器, 用于自动驾驶, 输入 LiDAR 点云和反向 3D 捆绑框。 LiDAR 本地和全球特性分别使用稀疏的混和多尺度的变形关注编码。 在解码器头中, 首先, 在新的 Li3DeTr 交叉注意块中, 我们将 Li3DeTr 全球特性与 3D 预测联系起来, 利用从数据中学习的零散对象查询组合。 其次, 对象查询互动是使用多头自省来制作的。 最后, 解码层重复 $L* dedec} 和 3D 框框框框来改进对象查询的次数。 在 DETR 的启发下, 我们使用定位到定位损失来训练L3DeTr 网络。 在没有钟和哨子的情况下, L3DeTR 网络实现了61. AP 和 67.6% NDS 超过 状态的状态查询方法, 使用非最大性数据, 也使用了我们数据库的测试数据。