In recent years, how to strike a good trade-off between accuracy and inference speed has become the core issue for real-time semantic segmentation applications, which plays a vital role in real-world scenarios such as autonomous driving systems and drones. In this study, we devise a novel lightweight network using a multi-scale context fusion (MSCFNet) scheme, which explores an asymmetric encoder-decoder architecture to dispose this problem. More specifically, the encoder adopts some developed efficient asymmetric residual (EAR) modules, which are composed of factorization depth-wise convolution and dilation convolution. Meanwhile, instead of complicated computation, simple deconvolution is applied in the decoder to further reduce the amount of parameters while still maintaining high segmentation accuracy. Also, MSCFNet has branches with efficient attention modules from different stages of the network to well capture multi-scale contextual information. Then we combine them before the final classification to enhance the expression of the features and improve the segmentation efficiency. Comprehensive experiments on challenging datasets have demonstrated that the proposed MSCFNet, which contains only 1.15M parameters, achieves 71.9\% Mean IoU on the Cityscapes testing dataset and can run at over 50 FPS on a single Titan XP GPU configuration.
翻译:近年来,如何在精确度和推断速度之间实现良好的权衡,已成为实时语义分解应用的核心问题,这种应用在自主驱动系统和无人驾驶飞机等现实世界情景中发挥着至关重要的作用。在本研究中,我们设计了一个新型的轻量网络,使用多尺度环境聚合(MSCFNet)计划,探索一种不对称编码器分解器结构来解决这一问题。更具体地说,编码器采用一些开发出来的高效的不对称剩余(EAR)模块,这些模块由因素分解深度、熔化和变相组成。与此同时,简单的分解在解码器中应用,以进一步减少参数数量,同时保持高分解精度。此外,MSCFNet还利用网络不同阶段高效关注模块的分支来捕捉多尺度背景信息,然后在最后分类之前将这些模块合并起来,以加强特征的表达,提高分解效率。关于具有挑战性的数据集的全面实验显示,拟议的MSCFNet网络仅包含1.15M参数,在城市的FPPU上进行50PS的单个数据测试。