In recent years, vision-centric perception has flourished in various autonomous driving tasks, including 3D detection, semantic map construction, motion forecasting, and depth estimation. Nevertheless, the latency of vision-centric approaches is too high for practical deployment (e.g., most camera-based 3D detectors have a runtime greater than 300ms). To bridge the gap between ideal research and real-world applications, it is necessary to quantify the trade-off between performance and efficiency. Traditionally, autonomous-driving perception benchmarks perform the offline evaluation, neglecting the inference time delay. To mitigate the problem, we propose the Autonomous-driving StreAming Perception (ASAP) benchmark, which is the first benchmark to evaluate the online performance of vision-centric perception in autonomous driving. On the basis of the 2Hz annotated nuScenes dataset, we first propose an annotation-extending pipeline to generate high-frame-rate labels for the 12Hz raw images. Referring to the practical deployment, the Streaming Perception Under constRained-computation (SPUR) evaluation protocol is further constructed, where the 12Hz inputs are utilized for streaming evaluation under the constraints of different computational resources. In the ASAP benchmark, comprehensive experiment results reveal that the model rank alters under different constraints, suggesting that the model latency and computation budget should be considered as design choices to optimize the practical deployment. To facilitate further research, we establish baselines for camera-based streaming 3D detection, which consistently enhance the streaming performance across various hardware. ASAP project page: https://github.com/JeffWang987/ASAP.
翻译:近些年来,各种自主驱动任务,包括3D探测、语义图建设、运动预测和深度估计,都出现了以视觉为中心的认识中心,在各种自主驱动任务中,包括3D探测、语义图图的构造、运动预测和深度估计,尽管如此,以视觉为中心的方法的长度太高,无法实际部署(例如,大多数以相机为基础的3D探测器的运行时间超过300米);为了缩小理想研究和现实世界应用之间的差距,有必要量化业绩与效率之间的取舍。传统上,自主驾驶的认知基准进行离线评估,忽略了触发时间的延迟。为缓解问题,我们提议采用自动驱动的 StreAreAMML(ASAP)基准,这是评价独立驾驶中以视觉为中心的感知的在线表现的第一个基准。基于2Hz的附加说明的nuScenes数据集,我们首先提出一个批注解式延伸管道,为12Hz原始图像生成高框架模型标签。关于实际部署,而忽略了触发时间的间隔。关于实际部署,我们深视的流下对流流流(ASPUR)定位,这是评估在12RLA IM IM IM 计算中,用于不同的预算评估。