Semantic segmentation is a fundamental task in visual scene understanding. We focus on the supervised setting, where ground-truth semantic annotations are available. Based on knowledge about the high regularity of real-world scenes, we propose a method for improving class predictions by learning to selectively exploit information from neighboring pixels. In particular, our method is based on the prior that for each pixel, there is a seed pixel in its close neighborhood sharing the same prediction with the former. Motivated by this prior, we design a novel two-head network, named Offset Vector Network (OVeNet), which generates both standard semantic predictions and a dense 2D offset vector field indicating the offset from each pixel to the respective seed pixel, which is used to compute an alternative, seed-based semantic prediction. The two predictions are adaptively fused at each pixel using a learnt dense confidence map for the predicted offset vector field. We supervise offset vectors indirectly via optimizing the seed-based prediction and via a novel loss on the confidence map. Compared to the baseline state-of-the-art architectures HRNet and HRNet+OCR on which OVeNet is built, the latter achieves significant performance gains on two prominent benchmarks for semantic segmentation of driving scenes, namely Cityscapes and ACDC. Code is available at https://github.com/stamatisalex/OVeNet
翻译:OVeNet: 偏移向量网络用于语义分割
翻译摘要:
视觉场景理解中,语义分割是一项基础任务。我们专注于监督设置,其中有地面实况语义标注。基于关于现实场景高规则性的知识,我们提出了一种方法来通过学习选择性地利用来自邻近像素的信息来改善类预测。特别地,我们的方法基于每个像素的种子像素在它的紧密邻域内共享相同预测的先验知识上。受此先验知识的启发,我们设计了一种新的双重头部网络,称为 Offset Vector Network(OVeNet),它生成标准的语义预测和密集的二维偏移量向量场,指示从每个像素到相应种子像素的偏移量,它用于计算替代的,基于种子的语义预测。两个预测根据通过学习的密集置信度图自适应地在每个像素处融合为一个预测结果。我们通过优化基于种子的预测和针对置信度图的一种新型损失间接监督偏移向量。与 OVRC 状态下的基线最先进构架 HRNet 和 HRNet+OCR 相比,后者在用于驾驶场景的语义分割的两个杰出基准测试(Cityscapes 和 ACDC)中取得了显着的性能提升。 代码可在 https://github.com/stamatisalex/OVeNet 上获得。