We introduce compositional soft prompting (CSP), a parameter-efficient learning technique to improve the zero-shot compositionality of large-scale pretrained vision-language models (VLMs). VLMs can represent arbitrary classes as natural language prompts in their flexible text encoders, but they underperform state-of-the-art methods on compositional zero-shot benchmark tasks. To improve VLMs, we propose a novel form of soft prompting. We treat the attributes and objects that are composed to define classes as learnable tokens of vocabulary and tune them on multiple prompt compositions. During inference, we recompose the learned attribute-object vocabulary in new combinations. We show that CSP outperforms the original VLM on benchmark datasets by an average of 10.9 percentage points on AUC. CSP also outperforms CoOp, a soft prompting method that tunes the prefix context, by an average of 5.8 percentage points on AUC. We perform additional experiments to show that CSP improves generalization to attribute-only classification, higher-order attribute-attribute-object compositions, and combinations of pretrained attributes and fine-tuned objects.
翻译:我们引入了“软促动”的构件(CSP),这是一种提高大规模预先训练的视觉语言模型(VLMs)零光成份的参数高效学习技术。 VLMs可以在灵活的文本编码器中作为自然语言提示,代表任意等级,但是在构件零光基准任务方面表现不佳。为了改进“软促动”的新形式,我们建议一种“软促动”的新形式。我们将构成将各类定义为可学习词汇符号的属性和对象,并按多种快速成份调整它们。在推断过程中,我们重新将学到的属性对象词汇纳入新的组合中。我们显示,“CSP”在基础数据集上比原始的VLM高出10.9个百分点。 CSP还超越了“软促动”方法,即调和前缀环境的软促动方法,在“AUCS”上平均5.8个百分点。我们进行了更多的实验,以显示“CSP”改进了“属性-只分类”、更高级属性-属性调整前的属性和属性组合。