Recent advances in pixel-level tasks (e.g., segmentation) illustrate the benefit of long-range interactions between aggregated region-based representations that can enhance local features. However, such pixel-to-region associations and the resulting representation, which often take the form of attention, cannot model the underlying semantic structure of the scene (e.g., individual objects and, by extension, their interactions). In this work, we take a step toward addressing this limitation. Specifically, we propose an architecture where we learn to project image features into latent region representations and perform global reasoning across them, using a transformer, to produce contextualized and scene-consistent representations that are then fused with original pixel-level features. Our design enables the latent regions to represent semantically meaningful concepts, by ensuring that activated regions are spatially disjoint and unions of such regions correspond to connected object segments. The resulting semantic global reasoning (SGR) is end-to-end trainable and can be combined with any semantic segmentation framework and backbone. Combining SGR with DeepLabV3 results in a semantic segmentation performance that is competitive to the state-of-the-art, while resulting in more semantically interpretable and diverse region representations, which we show can effectively transfer to detection and instance segmentation. Further, we propose a new metric that allows us to measure the semantics of representations at both the object class and instance level.
翻译:等离子层任务(如分层)的近期进展显示了基于区域的综合代表之间长期互动的好处,这种互动能够增强地方特征;然而,这种像素对区域协会及其由此产生的代表形式往往以关注的形式出现,无法模拟现场的基本语义结构(如个别物体,以及由此延伸的相互作用)。在这项工作中,我们朝着解决这一局限性迈出了一步。具体地说,我们提议了一个结构,让我们学会将图像特征投射到潜在的区域代表形式中,并在它们之间进行全球推理,使用变压器,产生背景化和场景一致的表达形式,然后与原有的像素级特征相结合。我们的设计使潜在区域能够代表具有意义的概念,确保活跃区域在空间上脱节,这些区域与相连接的物体部分相对。由此产生的语义性全球推理(SGR)是端到端端的,可以与任何语义分解框架和主干线相结合。将SGR与DeepLabV3的直线和场景式代表形式相结合,同时在语段段段级上有效地解释,我们可以显示具有竞争力的立式区域表现。