Recently, Bird's-Eye-View (BEV) representation has gained increasing attention in multi-view 3D object detection, which has demonstrated promising applications in autonomous driving. Although multi-view camera systems can be deployed at low cost, the lack of depth information makes current approaches adopt large models for good performance. Therefore, it is essential to improve the efficiency of BEV 3D object detection. Knowledge Distillation (KD) is one of the most practical techniques to train efficient yet accurate models. However, BEV KD is still under-explored to the best of our knowledge. Different from image classification tasks, BEV 3D object detection approaches are more complicated and consist of several components. In this paper, we propose a unified framework named BEV-LGKD to transfer the knowledge in the teacher-student manner. However, directly applying the teacher-student paradigm to BEV features fails to achieve satisfying results due to heavy background information in RGB cameras. To solve this problem, we propose to leverage the localization advantage of LiDAR points. Specifically, we transform the LiDAR points to BEV space and generate the foreground mask and view-dependent mask for the teacher-student paradigm. It is to be noted that our method only uses LiDAR points to guide the KD between RGB models. As the quality of depth estimation is crucial for BEV perception, we further introduce depth distillation to our framework. Our unified framework is simple yet effective and achieves a significant performance boost. Code will be released.
翻译:最近,Bird's-Eye-View(BEV)在多视图 3D 对象探测中得到了越来越多的关注,这显示了在自主驾驶方面很有希望的应用。虽然多视图摄像系统可以低成本地部署,但缺乏深度信息使得当前方法采用大型模型来取得良好业绩。因此,提高BEV 3D 对象探测效率至关重要。知识蒸馏(KD)是培训高效而准确模型的最实用技术之一。然而,BEV KD 仍然在探索中处于最佳知识水平。与图像分类任务不同,BEV 3D 对象探测方法更为复杂,由多个组成部分组成。在本文件中,我们提议了一个名为BEV-LGKD的统一框架,以教师学习方式传授知识。然而,直接将师范式模型应用于BDE 特征无法取得令人满意的结果,因为 RGB 相机的背景信息非常丰富。为了解决这一问题,我们提议进一步利用LDAR 点的本地化优势。具体地说,我们把LEVAR 点转换为BEV-LD 空间定位和深度框架,我们教师的深度选择了BD 的深度,我们的主要的模型是我们所使用的工具。