Recently, the zero-shot semantic segmentation problem has attracted increasing attention, and the best performing methods are based on two-stream networks: one stream for proposal mask generation and the other for segment classification using a pre-trained visual-language model. However, existing two-stream methods require passing a great number of (up to a hundred) image crops into the visuallanguage model, which is highly inefficient. To address the problem, we propose a network that only needs a single pass through the visual-language model for each input image. Specifically, we first propose a novel network adaptation approach, termed patch severance, to restrict the harmful interference between the patch embeddings in the pre-trained visual encoder. We then propose classification anchor learning to encourage the network to spatially focus on more discriminative features for classification. Extensive experiments demonstrate that the proposed method achieves outstanding performance, surpassing state-of-theart methods while being 4 to 7 times faster at inference. We release our code at https://github.com/CongHan0808/DeOP.git.
翻译:近来,零样本语义分割问题备受关注,表现最优的方法基于两流网络:一流生成提议掩模,另一流利用预训练的视觉语言模型进行段分类。然而,现有的两流方法需要传递大量的(高达一百个)图像裁剪到视觉语言模型中,非常低效。为了应对这个问题,我们提出了一种网络,每个输入图像只需要对视觉语言模型进行单次通行。具体而言,我们首先提出了一种新颖的网络适应方法,称为“patch severance”,以限制预训练视觉编码器中裁剪嵌入之间的有害干扰。然后,我们提出了分类锚点学习,以鼓励网络在分类方面更加专注于空间上更加具有区分力的特征。广泛的实验证明,所提出的方法在推理时达到了出色的性能,超过了现有的方法,在推理速度上还快了4至7倍。我们在https://github.com/CongHan0808/DeOP.git上发布了我们的代码。