In this paper, we address the semantic segmentation problem with a focus on the context aggregation strategy. Our motivation is that the label of a pixel is the category of the object that the pixel belongs to. We present a simple yet effective approach, object-contextual representations, characterizing a pixel by exploiting the representation of the corresponding object class. First, we learn object regions under the supervision of ground-truth segmentation. Second, we compute the object region representation by aggregating the representations of the pixels lying in the object region. Last, % the representation similarity we compute the relation between each pixel and each object region and augment the representation of each pixel with the object-contextual representation which is a weighted aggregation of all the object region representations according to their relations with the pixel. We empirically demonstrate that the proposed approach achieves competitive performance on various challenging semantic segmentation benchmarks: Cityscapes, ADE20K, LIP, PASCAL-Context, and COCO-Stuff. Cityscapes, ADE20K, LIP, PASCAL-Context, and COCO-Stuff. Our submission "HRNet + OCR + SegFix" achieves 1-st place on the Cityscapes leaderboard by the time of submission. Code is available at: https://git.io/openseg and https://git.io/HRNet.OCR. We rephrase the object-contextual representation scheme using the Transformer encoder-decoder framework. The details are presented in~Section3.3.
翻译:在本文中,我们通过侧重于背景汇总战略来解决语义分割问题。 我们的动机是像素标签是像素所属对象的类别。 我们展示了简单而有效的方法, 对象- 逻辑表达方式, 通过利用相应的对象类的表示方式来描述像素。 首先, 我们在地面真相分割的监督下学习目标区域。 其次, 我们通过将位于目标区域的像素表示方式汇总来计算目标区域代表。 最后, 我们的表示方式与每个像素和每个目标区域之间的关系相似, 并且增加了每个像素的表示方式与目标- 变量表达方式的相似性。 我们展示的是, 目标区域表达方式在各种具有挑战性的语义分割基准上取得了竞争性的表现: 城市景观、 ADE20K、 LIP- 以及 Context 和 CO- 、 目标 目标- 目标- 以及每个目标区域之间的关系, 并增加了每个像素的表达方式, 以对象- 目标- 变量表达式 : 由 ASEDE20K、 LASLL 和 Rex 提供 时间框架。