We present a novel bird's-eye-view (BEV) detector with perspective supervision, which converges faster and better suits modern image backbones. Existing state-of-the-art BEV detectors are often tied to certain depth pre-trained backbones like VoVNet, hindering the synergy between booming image backbones and BEV detectors. To address this limitation, we prioritize easing the optimization of BEV detectors by introducing perspective space supervision. To this end, we propose a two-stage BEV detector, where proposals from the perspective head are fed into the bird's-eye-view head for final predictions. To evaluate the effectiveness of our model, we conduct extensive ablation studies focusing on the form of supervision and the generality of the proposed detector. The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset. The code shall be released soon.
翻译:我们展示了一个具有视角监督的新型鸟眼视探测器,该探测器更快、更适合现代图像主干。现有的最新BEV探测器往往与Vovnet等经过培训的某种深度前骨干连接,妨碍了正在兴起的图像主干和BEV探测器之间的协同作用。为了应对这一限制,我们优先考虑通过引入视角空间监督来优化BEV探测器。为此,我们提议了一个两阶段的BEV探测器,将视角头的建议输入鸟类眼视头,用于最终预测。为了评估我们模型的有效性,我们开展了广泛的模拟研究,重点是监督形式和拟议探测器的通用性。拟议方法经过广泛传统和现代图像主干和广泛验证,并在大型nuScenes数据集上实现新的 SoTA结果。代码将很快发布。