In the field of 3D object detection for autonomous driving, the sensor portfolio including multi-modality and single-modality is diverse and complex. Since the multi-modal methods have system complexity while the accuracy of single-modal ones is relatively low, how to make a tradeoff between them is difficult. In this work, we propose a universal cross-modality knowledge distillation framework (UniDistill) to improve the performance of single-modality detectors. Specifically, during training, UniDistill projects the features of both the teacher and the student detector into Bird's-Eye-View (BEV), which is a friendly representation for different modalities. Then, three distillation losses are calculated to sparsely align the foreground features, helping the student learn from the teacher without introducing additional cost during inference. Taking advantage of the similar detection paradigm of different detectors in BEV, UniDistill easily supports LiDAR-to-camera, camera-to-LiDAR, fusion-to-LiDAR and fusion-to-camera distillation paths. Furthermore, the three distillation losses can filter the effect of misaligned background information and balance between objects of different sizes, improving the distillation effectiveness. Extensive experiments on nuScenes demonstrate that UniDistill effectively improves the mAP and NDS of student detectors by 2.0%~3.2%.
翻译:在自动驾驶的三维物体检测领域中,传感器组合既包括多模态和单模态,具有多样性和复杂性。由于多模态方法具有系统复杂性,而单模态方法的准确性相对较低,如何在它们之间进行平衡是困难的。在本文中,我们提出了一种通用跨模态知识蒸馏框架 (UniDistill),以提高单模态检测器的性能。具体而言,在训练过程中,UniDistill将teacher检测器和student检测器的特征投影到Bird's-Eye-View(BEV)中,这是一种对不同车载传感器友好的表示方式。然后,计算三个蒸馏损失以稀疏对齐前景特征,帮助学生模型从teacher模型中吸取教益并且不会在推理阶段引入额外的计算开销。利用不同检测器在BEV中相似的检测范式,UniDistill 很容易支持LiDAR-to-camera,camera-to-LiDAR,fusion-to-LiDAR以及fusion-to-camera的蒸馏路径。此外,三个蒸馏损失可以过滤掉不对齐的背景信息的影响,并平衡不同大小物体之间的影响,从而提高蒸馏效果。在nuScenes数据集上的广泛实验表明,UniDistill可以有效地提高student模型的mAP和NDS达到2.0%〜3.2%。