In previous deep-learning-based methods, semantic segmentation has been regarded as a static or dynamic per-pixel classification task, \textit{i.e.,} classify each pixel representation to a specific category. However, these methods only focus on learning better pixel representations or classification kernels while ignoring the structural information of objects, which is critical to human decision-making mechanism. In this paper, we present a new paradigm for semantic segmentation, named structure-aware extraction. Specifically, it generates the segmentation results via the interactions between a set of learnable structure tokens and the image feature, which aims to progressively extract the structural information of each category from the feature. Extensive experiments show that our StructToken outperforms the state-of-the-art on three widely-used benchmarks, including ADE20K, Cityscapes, and COCO-Stuff-10K.
翻译:在以往的深层学习方法中,语义分解被视为静态或动态的每像素分类任务,\ textit{i.e.}将每个像素表示方式分类为特定类别。然而,这些方法只侧重于学习更好的像素表示或分类内核,而忽视物体的结构信息,而这种结构信息对人类决策机制至关重要。在本文中,我们提出了一种新的语义分解模式,称为结构分解。具体地说,它通过一组可学习的结构符号和图像特征之间的相互作用产生分解结果,目的是逐步从特征中提取每一类的结构信息。广泛的实验表明,我们的“ StructToken”在三个广泛使用的基准上超越了最新技术,包括ADE20K、城市景景和CO-Shuff-10K。