3D object detection from multiple image views is a fundamental and challenging task for visual scene understanding. Owing to its low cost and high efficiency, multi-view 3D object detection has demonstrated promising application prospects. However, accurately detecting objects through perspective views is extremely difficult due to the lack of depth information. Current approaches tend to adopt heavy backbones for image encoders, making them inapplicable for real-world deployment. Different from the images, LiDAR points are superior in providing spatial cues, resulting in highly precise localization. In this paper, we explore the incorporation of LiDAR-based detectors for multi-view 3D object detection. Instead of directly training a depth prediction network, we unify the image and LiDAR features in the Bird-Eye-View (BEV) space and adaptively transfer knowledge across non-homogenous representations in a teacher-student paradigm. To this end, we propose \textbf{BEVDistill}, a cross-modal BEV knowledge distillation (KD) framework for multi-view 3D object detection. Extensive experiments demonstrate that the proposed method outperforms current KD approaches on a highly-competitive baseline, BEVFormer, without introducing any extra cost in the inference phase. Notably, our best model achieves 59.4 NDS on the nuScenes test leaderboard, achieving new state-of-the-art in comparison with various image-based detectors. Code will be available at https://github.com/zehuichen123/BEVDistill.
翻译:从多个图像视图中检测 3D 对象对于视觉场景理解是一项根本性且具有挑战性的任务。 多视图 3D 对象检测由于其成本低且效率高, 展示了有希望的应用前景。 然而, 由于缺乏深度信息, 通过视角观测准确检测对象非常困难。 目前的方法倾向于为图像编码器采用重脊柱, 使其无法用于真实世界的部署。 不同于图像, LiDAR 点在提供空间提示方面具有优势, 导致高度精确的本地化。 在本文中, 我们探索将基于LiDstill AR 的探测器纳入多视图 3D 对象探测。 我们不是直接培训深度预测网络, 而是将Bird- Eye- View(BEV) 空间中的图像和LiDAR 特征统一起来, 并适应性地将知识传输到教师- 学生- 测试范式中。 我们提议在跨模式 BDE 3D 上建立跨模式知识蒸馏(KD) 框架, 将实现我们目前VED 的高级测试级基准。