Recent advances in semantic segmentation generally adapt an ImageNet pretrained backbone with a special context module after it to quickly increase the field-of-view. Although successful, the backbone, in which most of the computation lies, does not have a large enough field-of-view to make the best decisions. Some recent advances tackle this problem by rapidly downsampling the resolution in the backbone while also having one or more parallel branches with higher resolutions. We take a different approach by designing a ResNeXt inspired block structure that uses two parallel 3x3 convolutional layers with different dilation rates to increase the field-of-view while also preserving the local details. By repeating this block structure in the backbone, we do not need to append any special context module after it. In addition, we propose a lightweight decoder that restores local information better than common alternatives. To demonstrate the effectiveness of our approach, our model RegSeg achieves state-of-the-art results on real-time Cityscapes and CamVid datasets. Using a T4 GPU with mixed precision, RegSeg achieves 78.3 mIOU on Cityscapes test set at 30 FPS, and 80.9 mIOU on CamVid test set at 70 FPS, both without ImageNet pretraining.
翻译:语义分割的最近进展一般地使图像网络先入为主的骨干经过预先训练, 并有一个特殊背景模块, 以迅速增加视野。 虽然这个骨干是成功的, 但大部分计算所在的骨干没有大到足以做出最佳决定的视野。 最近的一些进展通过在骨干中快速地减少对分辨率的描述来解决这个问题, 同时有一个或几个具有较高分辨率的平行分支。 我们采取不同的方法, 设计一个ResNeXt 启发性的块结构, 使用两个平行的3x3进化层, 具有不同变异率, 以提高视野范围, 同时保存本地细节 。 通过在骨干中重复这一块结构, 我们不需要在任何特殊背景模块之后再附加任何特定的外观模块。 此外, 我们提议了一个轻量的解码器, 来恢复本地信息比常见的替代品更好。 为了展示我们的方法的有效性, 我们的模型RegSeggeg在实时城市景象和CamVid数据集上, 使用具有混合精度的T4 GPU, RegSegS在C- 803 mO 和FPS 在CMVI 30 测试中, 在CMPS 前的30MPS。