Detecting 3D objects from multi-view images is a fundamental problem in 3D computer vision. Recently, significant breakthrough has been made in multi-view 3D detection tasks. However, the unprecedented detection performance of these vision BEV (bird's-eye-view) detection models is accompanied with enormous parameters and computation, which make them unaffordable on edge devices. To address this problem, in this paper, we propose a structured knowledge distillation framework, aiming to improve the efficiency of modern vision-only BEV detection models. The proposed framework mainly includes: (a) spatial-temporal distillation which distills teacher knowledge of information fusion from different timestamps and views, (b) BEV response distillation which distills teacher response to different pillars, and (c) weight-inheriting which solves the problem of inconsistent inputs between students and teacher in modern transformer architectures. Experimental results show that our method leads to an average improvement of 2.16 mAP and 2.27 NDS on the nuScenes benchmark, outperforming multiple baselines by a large margin.
翻译:从多视图像中检测三维对象是3D计算机愿景的一个基本问题。最近,在多视三维检测任务方面取得了重大突破。然而,这些视觉BEV(鸟眼视图)检测模型的空前检测性能伴随着巨大的参数和计算,使得这些模型在边缘设备上负担不起。为了解决这个问题,我们在本文件中提出了一个结构化的知识蒸馏框架,目的是提高现代只视视距BEV检测模型的效率。拟议的框架主要包括:(a) 空间-时空蒸馏,将教师对不同时标和视图的信息融合知识蒸馏成蒸馏,(b) 乙型EV反应蒸馏,将教师对不同支柱的反应蒸馏成,以及(c) 重量感应,这解决了学生和教师在现代变形结构中投入不一致的问题。实验结果表明,我们的方法导致在核星基准上平均改进2.16 mAP和2.27 NDS,以大幅度超过多个基线。