Recently, transformer-based networks have shown impressive results in semantic segmentation. Yet for real-time semantic segmentation, pure CNN-based approaches still dominate in this field, due to the time-consuming computation mechanism of transformer. We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation, which achieves better trade-off between performance and efficiency than CNN-based models. To achieve high inference efficiency on GPU-like devices, our RTFormer leverages GPU-Friendly Attention with linear complexity and discards the multi-head mechanism. Besides, we find that cross-resolution attention is more efficient to gather global context information for high-resolution branch by spreading the high level knowledge learned from low-resolution branch. Extensive experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer, it achieves state-of-the-art on Cityscapes, CamVid and COCOStuff, and shows promising results on ADE20K. Code is available at PaddleSeg: https://github.com/PaddlePaddle/PaddleSeg.
翻译:最近,基于变压器的网络在语义分割方面显示出令人印象深刻的结果。然而,对于实时语义分割而言,纯粹的CNN方法仍在这一领域占主导地位,因为变压器的计算机制耗时。我们提议了RTFormer,这是一个高效的双分辨率变压器,用于实时语义分隔,其性能和效率的权衡比CNN模型要好。为了在类似GPU的装置上实现高推论效率,我们的RTFermer利用GPU-友好式的GPU-Friend Reflict 注意线性复杂性并抛弃多头机制。此外,我们发现交叉关注通过传播从低分辨率分支获得的高水平知识,为高分辨率分支收集全球背景信息更为高效。关于主流基准的广泛实验表明我们提议的RTFormer的效力,它实现了城市景象、CamVid和COStuff, 并展示了ADE20K的有希望的结果。代码可在Padledleseg: https://giuth.com/Padddled/Paddddledal/Paddled/Paddledled/PAdlepplef。