This paper studies the context aggregation problem in semantic image segmentation. The existing researches focus on improving the pixel representations by aggregating the contextual information within individual images. Though impressive, these methods neglect the significance of the representations of the pixels of the corresponding class beyond the input image. To address this, this paper proposes to mine the contextual information beyond individual images to further augment the pixel representations. We first set up a feature memory module, which is updated dynamically during training, to store the dataset-level representations of various categories. Then, we learn class probability distribution of each pixel representation under the supervision of the ground-truth segmentation. At last, the representation of each pixel is augmented by aggregating the dataset-level representations based on the corresponding class probability distribution. Furthermore, by utilizing the stored dataset-level representations, we also propose a representation consistent learning strategy to make the classification head better address intra-class compactness and inter-class dispersion. The proposed method could be effortlessly incorporated into existing segmentation frameworks (e.g., FCN, PSPNet, OCRNet and DeepLabV3) and brings consistent performance improvements. Mining contextual information beyond image allows us to report state-of-the-art performance on various benchmarks: ADE20K, LIP, Cityscapes and COCO-Stuff.
翻译:本文研究了语义图解图解中的背景汇总问题。 现有研究侧重于通过将背景信息汇总到个人图像中来改进像素表示方式。 虽然这些方法令人印象深刻, 但忽略了输入图像以外相应类别像素表示方式的重要性。 为了解决这个问题, 本文建议将个人图像以外的背景信息埋存到个人图像之外, 以进一步加强像素表示方式。 我们首先设置了一个特性内存模块, 以在培训期间动态更新, 以存储各类数据集层表示方式 。 然后, 我们学习了在地面图解分割监督下每个像素表示方式的分类概率分布 。 最后, 通过根据相应的分类概率分布整合数据集级别表示方式, 加强了每个像素的表示方式。 此外, 通过使用存储的数据集级别表示方式, 我们还提出一个持续学习策略, 使分类头更好地处理类内紧凑和阶级间分散性表示方式。 提议的方法可以不费力地纳入现有的分解框架( 例如, FCN, PSPNet, OCRNet 和 DeepLLabV3) 的分类代表方式, 通过根据相应的分类分布式分布式分布式报告, 使得我们得以进行连续的业绩改进。