In the past decade, convolutional neural networks (CNNs) have shown prominence for semantic segmentation. Although CNN models have very impressive performance, the ability to capture global representation is still insufficient, which results in suboptimal results. Recently, Transformer achieved huge success in NLP tasks, demonstrating its advantages in modeling long-range dependency. Recently, Transformer has also attracted tremendous attention from computer vision researchers who reformulate the image processing tasks as a sequence-to-sequence prediction but resulted in deteriorating local feature details. In this work, we propose a lightweight real-time semantic segmentation network called LETNet. LETNet combines a U-shaped CNN with Transformer effectively in a capsule embedding style to compensate for respective deficiencies. Meanwhile, the elaborately designed Lightweight Dilated Bottleneck (LDB) module and Feature Enhancement (FE) module cultivate a positive impact on training from scratch simultaneously. Extensive experiments performed on challenging datasets demonstrate that LETNet achieves superior performances in accuracy and efficiency balance. Specifically, It only contains 0.95M parameters and 13.6G FLOPs but yields 72.8\% mIoU at 120 FPS on the Cityscapes test set and 70.5\% mIoU at 250 FPS on the CamVid test dataset using a single RTX 3090 GPU. The source code will be available at https://github.com/IVIPLab/LETNet.
翻译:在过去的十年中,变异神经网络(CNNs)显示出了对语义分割的突出作用。尽管有线电视网模型有令人印象深刻的性能,但捕捉全球代表性的能力仍然不足,这导致不尽人意的结果。最近,变异器在NLP任务中取得了巨大成功,展示了其在长距离依赖性模型方面的优势。最近,变异器还吸引了计算机视觉研究人员的巨大关注,他们重新将图像处理任务作为序列到序列的预测,但导致当地特性细节恶化。在这项工作中,我们提议建立一个轻量的实时语义分割网络,称为LETNet。LETNet将U型CNN与变异器有效地结合成一个胶囊嵌嵌入式,以弥补各自的缺陷。与此同时,精心设计的轻量级的Dilate Batleneck(LDB)模块和功能增强(FF)5/FLPS 30 CMVS的测试将产生120MVS 和FLPS CMVS 120 MQ 标准。