Self-attention is of vital importance in semantic segmentation as it enables modeling of long-range context, which translates into improved performance. We argue that it is equally important to model short-range context, especially to tackle cases where not only the regions of interest are small and ambiguous, but also when there exists an imbalance between the semantic classes. To this end, we propose Masked Supervised Learning (MaskSup), an effective single-stage learning paradigm that models both short- and long-range context, capturing the contextual relationships between pixels via random masking. Experimental results demonstrate the competitive performance of MaskSup against strong baselines in both binary and multi-class segmentation tasks on three standard benchmark datasets, particularly at handling ambiguous regions and retaining better segmentation of minority classes with no added inference cost. In addition to segmenting target regions even when large portions of the input are masked, MaskSup is also generic and can be easily integrated into a variety of semantic segmentation methods. We also show that the proposed method is computationally efficient, yielding an improved performance by 10\% on the mean intersection-over-union (mIoU) while requiring $3\times$ less learnable parameters.
翻译:在语义分解方面,自我注意至关重要,因为它能够模拟长距离背景的像素,而长距离背景的模型化则转化为改进的性能。我们争辩说,对短距离背景进行模型化同样重要,特别是处理不仅有关区域小和模糊,而且语义类别之间存在不平衡的情况。为此,我们提议采用一个有效的单一阶段学习模式,即模拟短距离和长距离背景,通过随机掩码捕捉像素之间的背景关系。实验结果显示,在三个标准基准数据集中,特别是在处理模糊区域和保持少数群体类别更好的分解,而不增加推论成本的情况下,在三个标准基准数据集中,在双级和多级分解任务中,MaskSup具有很强的竞争性性能。除了将目标区域分解之外,即使大量投入被遮掩蔽,MaskSup也是通用的,而且很容易纳入各种语义分解方法。我们还表明,拟议方法在计算上效率很高,在平均的交叉参数上需要10美元改进业绩,同时需要10美元。