带有冷冻视觉语言模型的开放式词汇语义分离 (Open-vocabulary Semantic Segmentation with Frozen Vision-Language Models)

When trained at a sufficient scale, self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks. In this paper, we investigate simple, yet effective approaches for adapting the pre-trained foundation models to the downstream task of interest, namely, open-vocabulary semantic segmentation. To this end, we make the following contributions: (i) we introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept through a handful of image segmentation data. As a consequence, the model gains the capability of zero-shot transfer to segment novel categories; (ii) without loss of generality, we experiment on a broad range of self-supervised models that have been pre-trained with different schemes, e.g. visual-only models (MoCo v3, DINO), language-only models (BERT), visual-language model (CLIP), and show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data; (iii) we conduct thorough ablation studies to analyze the critical components in our proposed Fusioner, while evaluating on standard benchmarks, e.g. PASCAL-5i and COCO-20i , it surpasses existing state-of-the-art models by a large margin, despite only being trained on frozen visual and language features; (iv) to measure the model's robustness on learning visual-language correspondence, we further evaluate on synthetic dataset, named Mosaic-4, where images are constructed by mosaicking the samples from FSS-1000. Fusioner demonstrates superior performance over previous models.

翻译：在经过足够规模的培训后,自我监督的学习显示出了解决广泛的视觉或语言理解任务的显著能力。在本文中,我们调查了简单而有效的方法,使经过训练的基础模型适应下游感兴趣的任务,即开放语言语义分解。为此,我们做出以下贡献:(一) 我们引入Fusioner,使用一个轻量、基于变压器的聚合模块,将冷冻的视觉代表与语言概念通过少量图像分解数据对齐。结果,模型获得了零发转换到分部分新颖类别的能力;(二) 在不丧失一般性的情况下,我们实验了广泛的自我监督的基础模型,这些模型经过不同办法的预先培训,例如:视觉模型(Mocov3,DINO)、语言专用模型(BERT)、视觉语言模型(CLIP),以及显示,拟议的视觉和语言模型(SU)对任何一对视觉和语言模型都是有效的,甚至预先在直观-20级对正对等的对应数据中进行测试;(二) 常规Sy-oal-deal 图像,我们进行精确的模型的测量,而不断进行数据分析。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/