Fully supervised semantic segmentation technologies bring a paradigm shift in scene understanding. However, the burden of expensive labeling cost remains as a challenge. To solve the cost problem, recent studies proposed language model based zero-shot semantic segmentation (L-ZSSS) approaches. In this paper, we address L-ZSSS has a limitation in generalization which is a virtue of zero-shot learning. Tackling the limitation, we propose a language-model-free zero-shot semantic segmentation framework, Spatial and Multi-scale aware Visual Class Embedding Network (SM-VCENet). Furthermore, leveraging vision-oriented class embedding SM-VCENet enriches visual information of the class embedding by multi-scale attention and spatial attention. We also propose a novel benchmark (PASCAL2COCO) for zero-shot semantic segmentation, which provides generalization evaluation by domain adaptation and contains visually challenging samples. In experiments, our SM-VCENet outperforms zero-shot semantic segmentation state-of-the-art by a relative margin in PASCAL-5i benchmark and shows generalization-robustness in PASCAL2COCO benchmark.
翻译:完全监督的语义分解技术在现场理解中带来了范式的转变。然而,昂贵的标签成本负担仍是一个挑战。为了解决成本问题,最近的研究提出了基于零光语义分解(L-ZSSS)的语言模型方法。在本文件中,我们处理L-ZSSSS在普通化方面存在局限性,这是零光学学习的结果。处理这一局限性,我们提出一个无语言模型零光语断解分化框架,即空间和多尺度有意识的视觉级嵌入网络(SM-VCENet ) 。此外,利用面向愿景的分类嵌入SM-VCENet 来丰富以多尺度关注和空间关注嵌入的阶级的视觉信息。我们还提出了零光谱分化的新基准(PASCAL2CO ),该基准通过区域适应提供全光化评价,并包含有视觉挑战的样本。在实验中,我们的SM-VCENet在PASAL-5i基准中,将零光成零光的语分化状态,并显示普通化的PASAL-L2基准。