Recently, the pure camera-based Bird's-Eye-View (BEV) perception removes expensive Lidar sensors, making it a feasible solution for economical autonomous driving. However, most existing BEV solutions either suffer from modest performance or require considerable resources to execute on-vehicle inference. This paper proposes a simple yet effective framework, termed Fast-BEV, which is capable of performing real-time BEV perception on the on-vehicle chips. Towards this goal, we first empirically find that the BEV representation can be sufficiently powerful without expensive view transformation or depth representation. Starting from M2BEV baseline, we further introduce (1) a strong data augmentation strategy for both image and BEV space to avoid over-fitting (2) a multi-frame feature fusion mechanism to leverage the temporal information (3) an optimized deployment-friendly view transformation to speed up the inference. Through experiments, we show Fast-BEV model family achieves considerable accuracy and efficiency on edge. In particular, our M1 model (R18@256x704) can run over 50FPS on the Tesla T4 platform, with 47.0% NDS on the nuScenes validation set. Our largest model (R101@900x1600) establishes a new state-of-the-art 53.5% NDS on the nuScenes validation set. The code is released at: https://github.com/Sense-GVT/Fast-BEV.
翻译:最近,纯粹基于摄像的Bird-Eye-View(BEV)的感知消除了昂贵的Lidar感应器,使它成为经济自主驾驶的可行解决方案。然而,大多数现有的BEV解决方案要么表现不力,要么需要大量资源来进行车辆上的推断。本文提出了一个简单而有效的框架,称为快速BEV,它能够对车辆上的芯片进行实时的BEV感知。为了实现这一目标,我们首先从经验中发现,BEV代表器在不进行昂贵的视图变换或深度代表的情况下,可以足够强大。从M2BEV基线开始,我们进一步引入了(1) 图像和BEV空间的强力数据增强战略,以避免过度匹配(2) 多框架特性集成机制,以利用时间信息(3) 优化的、方便部署的视图转换来加速推断。我们通过实验,显示快速BEV模型家族在边缘取得了相当的准确性和效率。特别是,我们的M1模型(R18@256x7004)可以在Tesla T4平台上运行超过50FSSS,在NDS-NDS-N090在NnuS-DS-DS设置上设置上设置最大的N0100。在530S-DS-DS-DS-enSS-DS。