Autonomous driving requires a structured understanding of the surrounding road network to navigate. One of the most common and useful representation of such an understanding is done in the form of BEV lane graphs. In this work, we use the video stream from an onboard camera for online extraction of the surrounding's lane graph. Using video, instead of a single image, as input poses both benefits and challenges in terms of combining the information from different timesteps. We study the emerged challenges using three different approaches. The first approach is a post-processing step that is capable of merging single frame lane graph estimates into a unified lane graph. The second approach uses the spatialtemporal embeddings in the transformer to enable the network to discover the best temporal aggregation strategy. Finally, the third, and the proposed method, is an early temporal aggregation through explicit BEV projection and alignment of framewise features. A single model of this proposed simple, yet effective, method can process any number of images, including one, to produce accurate lane graphs. The experiments on the Nuscenes and Argoverse datasets show the validity of all the approaches while highlighting the superiority of the proposed method. The code will be made public.
翻译:自动驾驶需要结构化地理解周围的道路网络以便导航。最常见和有用的表示方法之一是在鸟瞰图(BEV)中绘制车道图。在本研究中,我们使用车载摄像头的视频流在线提取周围的车道图。将视频而不是单个图像用作输入既具有优点,也带来了挑战,如何将不同时间步的信息组合起来。我们使用三种不同的方法研究了出现的挑战。第一个方法是后处理步骤,能够将单帧车道图估计合并为一个统一的车道图。第二种方法使用转换器中的时空嵌入来使网络能够发现最佳的时间聚合策略。最后,第三种方法是通过显式的BEV投影和帧特征对齐,进行早期的时间聚合。这种提出的简单而有效的方法只需要一个模型即可处理任意数量的图像,包括一个图像,以产生精确的车道图。在Nuscenes和Argoverse数据集上的实验表明了所有方法的有效性,同时突显了提出的方法的优越性。代码将被公开发布。