侧适配器网络用于开放词汇语义分割 (Side Adapter Network for Open-Vocabulary Semantic Segmentation)

This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named Side Adapter Network (SAN). Our approach models the semantic segmentation task as a region recognition problem. A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias which is applied in the CLIP model to recognize the class of masks. This decoupled design has the benefit CLIP in recognizing the class of mask proposals. Since the attached side network can reuse CLIP features, it can be very light. In addition, the entire network can be trained end-to-end, allowing the side network to be adapted to the frozen CLIP model, which makes the predicted mask proposals CLIP-aware. Our approach is fast, accurate, and only adds a few additional trainable parameters. We evaluate our approach on multiple semantic segmentation benchmarks. Our method significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed. We hope our approach will serve as a solid baseline and help ease future research in open-vocabulary semantic segmentation. The code will be available at https://github.com/MendelXu/SAN.

翻译：本文提出了一种基于预训练视觉语言模型的开放词汇语义分割新框架，称为侧适配器网络（SAN）。我们的方法将语义分割任务建模为区域识别问题。我们连接了一个侧网络和一个冻结的CLIP模型，其中一个分支用于预测掩模提议，另一个分支用于预测注意偏差，该偏差应用于CLIP模型中以识别掩模的类别。这种解耦的设计由于在识别掩模提议的类别时可以借助CLIP模型带来了好处。由于所连接的侧网络可以重用CLIP特征，因此可以非常轻巧。此外，整个网络可以进行端到端训练，允许侧网络适应冻结的CLIP模型，使得预测的掩模提议具有CLIP感知。我们的方法速度快，准确性高, 只增加了少量可训练参数。我们在多个语义分割基准上进行了评估。相对于其他方法，我们的方法显著优于其他方法，可训练参数少18倍，推理速度快19倍。我们希望我们的方法能够作为一个坚实的基准，并有助于未来开放词汇语义分割研究。代码将在 https://github.com/MendelXu/SAN 上公开。

相关内容

Networking

关注 22

Networking：IFIP International Conferences on Networking。 Explanation：国际网络会议。 Publisher：IFIP。 SIT： http://dblp.uni-trier.de/db/conf/networking/index.html

【Hugging Face】使用自定义数据集微调语义分割模型，Fine-Tune a Semantic Segmentation Model with a Custom Dataset

专知会员服务

21+阅读 · 2022年3月18日

【CVPR 2022】跨模态检索的协同双流视觉-语言前训练模型，COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

专知会员服务

13+阅读 · 2022年3月12日

【CVPR 2022-UCSD&英伟达】GroupViT:从文本监督中产生语义分割，Semantic Segmentation Emerges from Text Supervision

专知会员服务

12+阅读 · 2022年3月9日

【CVPR 2022】多模态视频字幕的端到端生成预训练，End-to-end Generative Pretraining for Multimodal Video Captioning

专知会员服务

27+阅读 · 2022年3月3日