Bird's-Eye-View (BEV) maps have emerged as one of the most powerful representations for scene understanding due to their ability to provide rich spatial context while being easy to interpret and process. Such maps have found use in many real-world tasks that extensively rely on accurate scene segmentation as well as object instance identification in the BEV space for their operation. However, existing segmentation algorithms only predict the semantics in the BEV space, which limits their use in applications where the notion of object instances is also critical. In this work, we present the first BEV panoptic segmentation approach for directly predicting dense panoptic segmentation maps in the BEV, given a single monocular image in the frontal view (FV). Our architecture follows the top-down paradigm and incorporates a novel dense transformer module consisting of two distinct transformers that learn to independently map vertical and flat regions in the input image from the FV to the BEV. Additionally, we derive a mathematical formulation for the sensitivity of the FV-BEV transformation which allows us to intelligently weight pixels in the BEV space to account for the varying descriptiveness across the FV image. Extensive evaluations on the KITTI-360 and nuScenes datasets demonstrate that our approach exceeds the state-of-the-art in the PQ metric by 3.61 pp and 4.93 pp respectively.
翻译:鸟类- Eye- View (BEV) 地图由于能够提供丰富的空间环境,同时又易于解释和处理,因此成为最有力的现场理解表现之一。这些地图在许多现实世界任务中都得到了使用,这些任务广泛依赖BEV空间的精确场景分解和物体实例识别,然而,现有的分解算法只预测BEV空间的语义学,这限制了其在物体实例概念也至关重要的应用中的使用。在这项工作中,我们提出了第一种BEV全视分解法,直接预测BEV中密度广视分解图,在前视(FV)中有一个单一的单视镜图像。我们的架构遵循自上而下的模式,并包含由两个不同的变异变异器组成的新的密集变异器模块,这些变异器在FV至BEV的输入图像中独立地绘制垂直和平坦区域图。此外,我们为FV- 93-BEV变异变的敏感度做了数学配方位配方位,这使我们得以在BEV空间直接预测密集光学断断面图的比素,用于FVV- Stal- Stal- stations- sal- vical viewdal viewdal viewdal viewdal viewdal vical viewdal vical pps