Long-term temporal fusion is a crucial but often overlooked technique in camera-based Bird's-Eye-View (BEV) 3D perception. Existing methods are mostly in a parallel manner. While parallel fusion can benefit from long-term information, it suffers from increasing computational and memory overheads as the fusion window size grows. Alternatively, BEVFormer adopts a recurrent fusion pipeline so that history information can be efficiently integrated, yet it fails to benefit from longer temporal frames. In this paper, we explore an embarrassingly simple long-term recurrent fusion strategy built upon the LSS-based methods and find it already able to enjoy the merits from both sides, i.e., rich long-term information and efficient fusion pipeline. A temporal embedding module is further proposed to improve the model's robustness against occasionally missed frames in practical scenarios. We name this simple but effective fusing pipeline VideoBEV. Experimental results on the nuScenes benchmark show that VideoBEV obtains leading performance on various camera-based 3D perception tasks, including object detection (55.4% mAP and 62.9% NDS), segmentation (48.6% vehicle mIoU), tracking (54.8% AMOTA), and motion prediction (0.80m minADE and 0.463 EPA). Code will be available.
翻译:长期时间融合是一种关键技术,但经常被忽略。 现有方法大多是平行的。 虽然平行融合可以受益于长期信息,但随着聚变窗口规模的扩大,它会因计算和记忆的间接费用增加而受到影响。 或者, BEVFormer采用一种经常性融合管道,以便有效地整合历史信息,但未能从较长的时间框架中受益。 在本文中,我们探索了一种以LSS为基础的方法为基础的简单而长期的经常性融合战略,发现它已经能够享受到双方的优点,即丰富的长期信息和高效聚变管道。还进一步提议了一个时间嵌入模块,以提高模型在实际情景中有时错过的框架中的稳健性。我们把这个简单而有效的封存管道视频BEV。 核空间基准的实验结果显示,视频BEV在各种基于摄像的3D认知任务上取得了领先的绩效,包括对象探测(55.4% mAP和62.9% NDS)、可使用的分区定位(48.6% MER.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.M.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.M.M.A.A.A.A.A.A.A.A.A.A.A.A.A.A.A.</s>