This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN). Our approach models the semantic segmentation task as a region recognition problem. A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias which is applied in the CLIP model to recognize the class of masks. This decoupled design has the benefit CLIP in recognizing the class of mask proposals. Since the attached side network can reuse CLIP features, it can be very light. In addition, the entire network can be trained end-to-end, allowing the side network to be adapted to the frozen CLIP model, which makes the predicted mask proposals CLIP-aware. Our approach is fast, accurate, and only adds a few additional trainable parameters. We evaluate our approach on multiple semantic segmentation benchmarks. Our method significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed. We hope our approach will serve as a solid baseline and help ease future research in open-vocabulary semantic segmentation. The code will be available at https://github.com/MendelXu/SAN.
翻译:本文提出了一种基于预训练视觉语言模型的开放词汇语义分割新框架,称为侧适配器网络(SAN)。我们的方法将语义分割任务建模为区域识别问题。我们连接了一个侧网络和一个冻结的CLIP模型,其中一个分支用于预测掩模提议,另一个分支用于预测注意偏差,该偏差应用于CLIP模型中以识别掩模的类别。这种解耦的设计由于在识别掩模提议的类别时可以借助CLIP模型带来了好处。由于所连接的侧网络可以重用CLIP特征,因此可以非常轻巧。此外,整个网络可以进行端到端训练,允许侧网络适应冻结的CLIP模型,使得预测的掩模提议具有CLIP感知。我们的方法速度快,准确性高, 只增加了少量可训练参数。我们在多个语义分割基准上进行了评估。相对于其他方法,我们的方法显著优于其他方法,可训练参数少18倍,推理速度快19倍。我们希望我们的方法能够作为一个坚实的基准,并有助于未来开放词汇语义分割研究。代码将在 https://github.com/MendelXu/SAN 上公开。