Image segmentation is often ambiguous at the level of individual image patches and requires contextual information to reach label consensus. In this paper we introduce Segmenter, a transformer model for semantic segmentation. In contrast to convolution based approaches, our approach allows to model global context already at the first layer and throughout the network. We build on the recent Vision Transformer (ViT) and extend it to semantic segmentation. To do so, we rely on the output embeddings corresponding to image patches and obtain class labels from these embeddings with a point-wise linear decoder or a mask transformer decoder. We leverage models pre-trained for image classification and show that we can fine-tune them on moderate sized datasets available for semantic segmentation. The linear decoder allows to obtain excellent results already, but the performance can be further improved by a mask transformer generating class masks. We conduct an extensive ablation study to show the impact of the different parameters, in particular the performance is better for large models and small patch sizes. Segmenter attains excellent results for semantic segmentation. It outperforms the state of the art on the challenging ADE20K dataset and performs on-par on Pascal Context and Cityscapes.
翻译:在单个图像补丁级别上,图像的分解往往模糊不清, 需要背景信息才能达成标签共识。 在本文中, 我们引入了片段, 一个变压器模型, 用于语义分解。 与基于变动的方法相比, 我们的方法允许在第一个层和整个网络上建模全球背景。 我们建建在最近的视野变异器( ViT) 上, 并将其扩展至语义分解。 要做到这一点, 我们依靠与图像补丁相对应的输出嵌入, 并从这些嵌入的嵌入中获取类标签, 并配有点向线线线解密器或遮罩变异器解密器。 我们利用了为图像分类而预先训练的变压器模型, 并显示我们可以将其微调用于中度的语义分解析的数据集。 线解码器已经能够取得优异的结果, 但是通过生成类代口罩来进一步提高性。 我们进行广泛的对比研究, 以显示不同参数的影响, 特别是性能对大模型和小片断变变体大小更好。 分解器在图像分类分类上取得极优的结果。 。 它在城市的视野上显示了艺术的状态。 K- 。 和背景上显示 。