Bird's-eye-view (BEV) grid is a common representation for the perception of road components, e.g., drivable area, in autonomous driving. Most existing approaches rely on cameras only to perform segmentation in BEV space, which is fundamentally constrained by the absence of reliable depth information. Latest works leverage both camera and LiDAR modalities, but sub-optimally fuse their features using simple, concatenation-based mechanisms. In this paper, we address these problems by enhancing the alignment of the unimodal features in order to aid feature fusion, as well as enhancing the alignment between the cameras' perspective view (PV) and BEV representations. We propose X-Align, a novel end-to-end cross-modal and cross-view learning framework for BEV segmentation consisting of the following components: (i) a novel Cross-Modal Feature Alignment (X-FA) loss, (ii) an attention-based Cross-Modal Feature Fusion (X-FF) module to align multi-modal BEV features implicitly, and (iii) an auxiliary PV segmentation branch with Cross-View Segmentation Alignment (X-SA) losses to improve the PV-to-BEV transformation. We evaluate our proposed method across two commonly used benchmark datasets, i.e., nuScenes and KITTI-360. Notably, X-Align significantly outperforms the state-of-the-art by 3 absolute mIoU points on nuScenes. We also provide extensive ablation studies to demonstrate the effectiveness of the individual components.
翻译:鸟类眼视(BEV)网格是了解道路部件(如可自驾驶的可驾驶区)的共同表示。大多数现有办法仅依靠照相机在BEV空间进行分解,而BEV空间基本上因缺乏可靠的深度信息而受到限制。最新作品利用了相机和LIDAR模式,但利用简单、基于连接的机制,次优化地结合了它们的特点。在本文件中,我们解决这些问题的方法是加强单式特征的对齐,以协助特征融合,以及加强相机视角(PV)和BEV代表之间的对齐。我们建议X-Align,一个全新的端对端跨式和交叉视图学习框架,用于BEV分割,由以下组成部分组成:(一) 一个新的跨式模型功能调合(X-FA) 损失,(二) 基于关注的跨式跨式模型变异(X-FF)模块,以方便多式BEV特征的组合,以及(三) PV分部分的辅助性部分,与跨端-端-端-端-端-端-交叉分析S) 演示。我们提出的S-S-S-C-S-S-S-C-S-S-S-C-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-Servial-S-S-S-S-S-S-S-S-S-S-S-S-S-G-S-S-S-S-S-S-S-S-S-S-S-S-S-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-T-T-T-T-T-T-T-T-T-T-T-T-T-T-G-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-