Compositionality, the ability to combine existing concepts and generalize towards novel compositions, is a key functionality for intelligent entities. Here, we study the problem of Compositional Zero-Shot Learning (CZSL), which aims at recognizing novel attribute-object compositions. Recent approaches build their systems on top of large-scale Vision-Language Pre-trained (VLP) models, e.g. CLIP, and observe significant improvements. However, these methods treat CLIP as a black box and focus on pre- and post-CLIP operations. Here, we propose to dive deep into the architecture and insert adapters, a parameter-efficient technique proven to be effective among large language models, to each CLIP encoder layer. We further equip adapters with concept awareness so that concept-specific features of "object", "attribute" and "composition" can be extracted. We name our method CAILA, Concept-Aware Intra-Layer Adapters. Quantitative evaluations performed on three popular CZSL datasets, MIT-States, C-GQA, and UT-Zappos, reveal that CAILA achieves double-digit relative improvements against the current state-of-the-art on all benchmarks.
翻译:暂无翻译