Autonomous navigation requires scene understanding of the action-space to move or anticipate events. For planner agents moving on the ground plane, such as autonomous vehicles, this translates to scene understanding in the bird's-eye view (BEV). However, the onboard cameras of autonomous cars are customarily mounted horizontally for a better view of the surrounding. In this work, we study scene understanding in the form of online estimation of semantic BEV maps using the video input from a single onboard camera. We study three key aspects of this task, image-level understanding, BEV level understanding, and the aggregation of temporal information. Based on these three pillars we propose a novel architecture that combines these three aspects. In our extensive experiments, we demonstrate that the considered aspects are complementary to each other for BEV understanding. Furthermore, the proposed architecture significantly surpasses the current state-of-the-art. Code: https://github.com/ybarancan/BEV_feat_stitch.
翻译:自动导航需要对移动或预测事件的动作空间进行现场了解。 对于在地面上移动的板块代理人,例如自主飞行器,这将转化成鸟眼视图(BEV)对场了解。然而,自动汽车的机载摄像机通常是横向安装的,以便更好地了解周围情况。在这项工作中,我们利用机上单个摄像机的视频输入对语义BEV地图进行在线估计,研究现场了解的形式。我们研究了这项任务的三个关键方面:图像水平理解、BEV水平理解和时间信息汇总。基于这三个支柱,我们提出了一个将这三个方面结合起来的新结构。在广泛的实验中,我们证明所考虑的方面是相互补充的,以便了解BEV。此外,拟议的结构大大超过目前的艺术状态。代码:https://github.com/ybarancan/BEV_feat_stitch。