Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language models (VLMs) by constructing trainable prompts only for composed state-object pairs. Relying on learning the joint representation of seen compositions, these methods ignore the explicit modeling of the state and object, thus limiting the exploitation of pre-trained knowledge and generalization to unseen compositions. With a particular focus on the universality of the solution, in this work, we propose a novel paradigm for CZSL models that establishes three identification branches (i.e., Multi-Path) to jointly model the state, object, and composition. The presented Troika is our implementation that aligns the branch-specific prompt representations with decomposed visual features. To calibrate the bias between semantically similar multi-modal representations, we further devise a Cross-Modal Traction module into Troika that shifts the prompt representation towards the current visual content. We conduct extensive experiments on three popular benchmarks, where our method significantly outperforms existing methods in both closed-world and open-world settings.
翻译:最近的组合零样本学习(CZSL)方法通过仅为组合状态-物体对构建可训练提示,适应预训练的视觉-语言模型(VLM),这些方法依赖于学习已观察组合的联合表示,从而忽略了对状态和物体的明确建模,从而限制了对预训练知识的利用和到未见组合的泛化。 在特别关注解决方案的普适性的同时,我们提出了CZSL模型的一种新范式,建立了三个识别分支(即Multi-Path)来共同建模状态、物体和组合。我们的实现Troika将分支专用的提示表示与分解的视觉特征对齐。为了校准语义相似的多模态表示之间的偏差,我们进一步开发了一个跨模态牵引模块,将提示表示向当前的视觉内容移动。我们在三个流行的基准测试上进行了大量实验,在闭式和开式场景下,我们的方法都显著优于现有方法。