Semantic segmentation of remotely sensed urban scene images is required in a wide range of practical applications, such as land cover mapping, urban change detection, environmental protection, and economic assessment.Driven by rapid developments in deep learning technologies, the convolutional neural network (CNN) has dominated semantic segmentation for many years. CNN adopts hierarchical feature representation, demonstrating strong capabilities for local information extraction. However, the local property of the convolution layer limits the network from capturing the global context. Recently, as a hot topic in the domain of computer vision, Transformer has demonstrated its great potential in global information modelling, boosting many vision-related tasks such as image classification, object detection, and particularly semantic segmentation. In this paper, we propose a Transformer-based decoder and construct a UNet-like Transformer (UNetFormer) for real-time urban scene segmentation. For efficient segmentation, the UNetFormer selects the lightweight ResNet18 as the encoder and develops an efficient global-local attention mechanism to model both global and local information in the decoder. Extensive experiments reveal that our method not only runs faster but also produces higher accuracy compared with state-of-the-art lightweight models. Specifically, the proposed UNetFormer achieved 67.8% and 52.4% mIoU on the UAVid and LoveDA datasets, respectively, while the inference speed can achieve up to 322.4 FPS with a 512x512 input on a single NVIDIA GTX 3090 GPU. In further exploration, the proposed Transformer-based decoder combined with a Swin Transformer encoder also achieves the state-of-the-art result (91.3% F1 and 84.1% mIoU) on the Vaihingen dataset. The source code will be freely available at https://github.com/WangLibo1995/GeoSeg.
翻译:在一系列实际应用中,如土地覆盖测绘、城市变化探测、环境保护和经济评估,都需要对遥感城市景象进行语义分析。 由深层学习技术的快速发展驱动, 神经神经网络(CNN)多年来一直以语义分解为主。 CNN采用等级特征代表, 显示本地信息提取能力强。 然而, 熔层的本地特性限制了网络捕捉全球环境。 最近, 作为计算机视野领域的热题, 变异器在全球信息建模中展现了巨大的潜力, 提升了许多与视觉有关的任务, 如图像分类、 对象探测, 特别是语义分解。 在本文中, 我们提议以变异器为基础的解调器(UNetFormer) 多年来一直占主导地位。 为了高效的分解, UNetFormer将轻量的 ResNet 18 选取为内置值的UDIOU12 。 IMU- 5OI- 建立高效的全球地方关注机制, 提升全球和地方信息建模, 图像分类、 目标探测,特别是语义化器的解剖面实验显示我们所用的方法, 也只能快速解的ODIDO值数据。