Dense panoptic prediction is a key ingredient in many existing applications such as autonomous driving, automated warehouses or remote sensing. Many of these applications require fast inference over large input resolutions on affordable or even embedded hardware. We propose to achieve this goal by trading off backbone capacity for multi-scale feature extraction. In comparison with contemporaneous approaches to panoptic segmentation, the main novelties of our method are efficient scale-equivariant feature extraction, cross-scale upsampling through pyramidal fusion and boundary-aware learning of pixel-to-instance assignment. The proposed method is very well suited for remote sensing imagery due to the huge number of pixels in typical city-wide and region-wide datasets. We present panoptic experiments on Cityscapes, Vistas, COCO and the BSB-Aerial dataset. Our models outperform the state of the art on the BSB-Aerial dataset while being able to process more than a hundred 1MPx images per second on a RTX3090 GPU with FP16 precision and TensorRT optimization.
翻译:稠密的全景预测是许多应用的关键组成部分,如自主驾驶、自动化仓库或遥感。许多应用需要在可负担甚至是嵌入式硬件上对大输入分辨率进行快速推断。我们建议通过削减主干网络容量以换取多尺度特征提取来实现这个目标。与同期的全景分割方法相比,我们方法的主要特点是高效的尺度等变特征提取、金字塔融合的跨尺度上采样以及边缘感知的像素到实例分配学习。由于典型的市域和区域数据集中有大量的像素,因此我们的方法非常适合远程感应影像。我们在Cityscapes,Vistas,COCO和BSB-Aerial数据集上进行全景实验。我们的模型在BSB-Aerial数据集上优于现有技术水平,而且在RTX3090 GPU上,具有FP16精度和TensorRT优化,能够每秒处理超过100张1MPx图像。