This work explores the zero-shot compositional learning ability of large pre-trained vision-language models(VLMs) within the prompt-based learning framework and propose a model (\textit{PromptCompVL}) to solve the compositonal zero-shot learning (CZSL) problem. \textit{PromptCompVL} makes two design choices: first, it uses a soft-prompting instead of hard-prompting to inject learnable parameters to reprogram VLMs for compositional learning. Second, to address the compositional challenge, it uses the soft-embedding layer to learn primitive concepts in different combinations. By combining both soft-embedding and soft-prompting, \textit{PromptCompVL} achieves state-of-the-art performance on the MIT-States dataset. Furthermore, our proposed model achieves consistent improvement compared to other CLIP-based methods which shows the effectiveness of the proposed prompting strategies for CZSL.
翻译:这项工作探索了快速学习框架内大型预先培训的视觉语言模型(VLMs)的零光成像学习能力,并提出了一个模型(\ textit{PromptCompVL})以解决合成零光学习(CZSL)问题。\ textit{PromptCompVL}做了两种设计选择:首先,它使用软推动而不是硬推动来注入可学习参数来重新编制用于合成学习的VLMs。第二,为了应对构成挑战,它使用软编组层在不同组合中学习原始概念。通过将软编组和软编集相结合,\ textit{PromptCompVL}在麻省州数据集上实现了最新业绩。此外,我们提议的模型与其他显示拟议的CZSLSL加速战略有效性的基于CLIP方法相比,实现了一致的改进。