Bird's-Eye-View (BEV) semantic maps have become an essential component of automated driving pipelines due to the rich representation they provide for decision-making tasks. However, existing approaches for generating these maps still follow a fully supervised training paradigm and hence rely on large amounts of annotated BEV data. In this work, we address this limitation by proposing the first self-supervised approach for generating a BEV semantic map using a single monocular image from the frontal view (FV). During training, we overcome the need for BEV ground truth annotations by leveraging the more easily available FV semantic annotations of video sequences. Thus, we propose the SkyEye architecture that learns based on two modes of self-supervision, namely, implicit supervision and explicit supervision. Implicit supervision trains the model by enforcing spatial consistency of the scene over time based on FV semantic sequences, while explicit supervision exploits BEV pseudolabels generated from FV semantic annotations and self-supervised depth estimates. Extensive evaluations on the KITTI-360 dataset demonstrate that our self-supervised approach performs on par with the state-of-the-art fully supervised methods and achieves competitive results using only 1% of direct supervision in the BEV compared to fully supervised approaches. Finally, we publicly release both our code and the BEV datasets generated from the KITTI-360 and Waymo datasets.
翻译:鸟类- Eye- View 语义图(BEV) 360 语义图(BEV) 已经成为自动驱动管道的一个必要组成部分,因为它们为决策任务提供了丰富的代表性。然而,现有的制作这些地图的方法仍然遵循一个完全监督的培训模式,因此依赖大量附加说明的BEV数据。在这项工作中,我们提出第一个自我监督方法,利用从前视单单单单方图像生成BEV语义图(FV) 。在培训期间,我们利用较容易获得的视频序列FV语义说明,克服了BEV地面真相说明的必要性。因此,我们提出了SkyEye结构,该结构基于两种自我监督模式学习,即隐含监督和明确监督。 隐含的监督通过在FV语义序列中使场景空间更加一致,同时明确监督利用FV语义语义图解和自我监督深度估计。 KIT - 360 数据集的深入评估显示,我们仅通过自我监督的B级方法,在经过全面监督的B级、经过监督的状态下,通过完全监督、经过监督的路径上,实现了我们自监督的自我监督的B级数据。