Current on-board chips usually have different computing power, which means multiple training processes are needed for adapting the same learning-based algorithm to different chips, costing huge computing resources. The situation becomes even worse for 3D perception methods with large models. Previous vision-centric 3D perception approaches are trained with regular grid-represented feature maps of fixed resolutions, which is not applicable to adapt to other grid scales, limiting wider deployment. In this paper, we leverage the Polar representation when constructing the BEV feature map from images in order to achieve the goal of training once for multiple deployments. Specifically, the feature along rays in Polar space can be easily adaptively sampled and projected to the feature in Cartesian space with arbitrary resolutions. To further improve the adaptation capability, we make multi-scale contextual information interact with each other to enhance the feature representation. Experiments on a large-scale autonomous driving dataset show that our method outperforms others as for the good property of one training for multiple deployments.
翻译:当前的车载芯片通常具有不同的计算能力,这意味着针对不同芯片调整相同的基于学习的算法需要多次训练,造成了巨大的计算资源浪费。对于具有大型模型的3D感知方法,情况会更糟。先前的视觉中心的3D感知方法是使用定量分辨率的常规网格表示的特征图训练的,这不适用于适应其他网格尺度,从而限制了更广泛的部署。在本文中,我们利用极坐标表示法,在构建BEV特征图时从图像中提取特征,以实现一次训练适用于多种部署情景的目标。具体而言,极坐标空间中的条上特征可以很容易地自适应抽样并投影到具有任意分辨率的笛卡尔空间中的特征上。为了进一步提高适应能力,我们让多尺度的上下文信息相互交互以增强特征表示。在大规模自动驾驶数据集上的实验证明,我们的方法胜过其他方法,表现良好。