When trained at a sufficient scale, self-supervised learning has exhibited a notable ability to solve a wide range of visual or language understanding tasks. In this paper, we investigate simple, yet effective approaches for adapting the pre-trained foundation models to the downstream task of interest, namely, open-vocabulary semantic segmentation. To this end, we make the following contributions: (i) we introduce Fusioner, with a lightweight, transformer-based fusion module, that pairs the frozen visual representation with language concept through a handful of image segmentation data. As a consequence, the model gains the capability of zero-shot transfer to segment novel categories; (ii) without loss of generality, we experiment on a broad range of self-supervised models that have been pre-trained with different schemes, e.g. visual-only models (MoCo v3, DINO), language-only models (BERT), visual-language model (CLIP), and show that, the proposed fusion approach is effective to any pair of visual and language models, even those pre-trained on a corpus of uni-modal data; (iii) we conduct thorough ablation studies to analyze the critical components in our proposed Fusioner, while evaluating on standard benchmarks, e.g. PASCAL-5i and COCO-20i , it surpasses existing state-of-the-art models by a large margin, despite only being trained on frozen visual and language features; (iv) to measure the model's robustness on learning visual-language correspondence, we further evaluate on synthetic dataset, named Mosaic-4, where images are constructed by mosaicking the samples from FSS-1000. Fusioner demonstrates superior performance over previous models.
翻译:在经过足够规模的培训后,自我监督的学习显示出了解决广泛的视觉或语言理解任务的显著能力。 在本文中,我们调查了简单而有效的方法,使经过训练的基础模型适应下游感兴趣的任务,即开放语言语义分解。为此,我们做出以下贡献:(一) 我们引入Fusioner,使用一个轻量、基于变压器的聚合模块,将冷冻的视觉代表与语言概念通过少量图像分解数据对齐。结果,模型获得了零发转换到分部分新颖类别的能力;(二) 在不丧失一般性的情况下,我们实验了广泛的自我监督的基础模型,这些模型经过不同办法的预先培训,例如:视觉模型(Mocov3,DINO)、语言专用模型(BERT)、视觉语言模型(CLIP),以及显示,拟议的视觉和语言模型(SU)对任何一对视觉和语言模型都是有效的,甚至预先在直观-20级对正对等的对应数据中进行测试;(二) 常规Sy-oal-deal 图像,我们进行精确的模型的测量,而不断进行数据分析。