Bird's eye view (BEV) representation is a new perception formulation for autonomous driving, which is based on spatial fusion. Further, temporal fusion is also introduced in BEV representation and gains great success. In this work, we propose a new method that unifies both spatial and temporal fusion and merges them into a unified mathematical formulation. The unified fusion could not only provide a new perspective on BEV fusion but also brings new capabilities. With the proposed unified spatial-temporal fusion, our method could support long-range fusion, which is hard to achieve in conventional BEV methods. Moreover, the BEV fusion in our work is temporal-adaptive, and the weights of temporal fusion are learnable. In contrast, conventional methods mainly use fixed and equal weights for temporal fusion. Besides, the proposed unified fusion could avoid information lost in conventional BEV fusion methods and make full use of features. Extensive experiments and ablation studies on the NuScenes dataset show the effectiveness of the proposed method and our method gains the state-of-the-art performance in the map segmentation task.
翻译:鸟类的眼睛视图(BEV) 代表是一种基于空间聚变的自主驾驶的新认识配方。 此外, 时间聚变还引入了时间聚变, 并取得了巨大成功。 在这项工作中, 我们提出了一种新的方法, 将空间和时间聚变统一起来, 并把它们合并成一个统一的数学配方。 统一的聚变不仅可以提供对BEV聚变的新视角, 还能带来新的能力。 有了拟议的统一空间- 时间聚变, 我们的方法可以支持远程聚变, 而在常规的BEV方法中很难实现。 此外, 我们工作中的BEV聚变具有时间适应性, 时间聚变的权重是可以学习的。 相比之下, 常规方法主要使用固定和等量的权重来进行时间聚变。 此外, 拟议的统一聚变组合可以避免在常规的BEV聚变方法中丢失的信息, 并充分利用特性。 关于 NuScenes数据集的大规模实验和校准研究显示拟议方法的有效性, 以及我们的方法在地图分割任务中获得了状态性表现。