Compositional Zero-Shot Learning (CZSL) aims to recognize unseen compositions formed from seen state and object during training. Since the same state may be various in the visual appearance while entangled with different objects, CZSL is still a challenging task. Some methods recognize state and object with two trained classifiers, ignoring the impact of the interaction between object and state; the other methods try to learn the joint representation of the state-object compositions, leading to the domain gap between seen and unseen composition sets. In this paper, we propose a novel Siamese Contrastive Embedding Network (SCEN) (Code: https://github.com/XDUxyLi/SCEN-master) for unseen composition recognition. Considering the entanglement between state and object, we embed the visual feature into a Siamese Contrastive Space to capture prototypes of them separately, alleviating the interaction between state and object. In addition, we design a State Transition Module (STM) to increase the diversity of training compositions, improving the robustness of the recognition model. Extensive experiments indicate that our method significantly outperforms the state-of-the-art approaches on three challenging benchmark datasets, including the recent proposed C-QGA dataset.
翻译:零位成像学习( CZSL) 旨在识别在训练期间从可见状态和对象中形成的看不见的构成。 由于同一状态在视觉外观中可能各不相同,而与不同对象交织在一起, CZSL仍然是一个具有挑战性的任务。 有些方法通过两种经过训练的分类器来识别状态和对象, 忽略了对象和国家之间相互作用的影响; 其他方法试图学习国家- 对象构成的共同表示方式, 从而导致可见和看不见的构成集成体之间的域间差距 。 在本文中, 我们提议建立一个新型的Siamese 对比嵌入网络( CECEN ) ( Code: https://github.com/XDUxyLi/ SCEN- Master), 用于识别看不见的构成。 考虑到状态和对象之间的缠绕动, 我们将视觉特征嵌入一个子对立空间, 以分别捕捉捉到它们的原型, 减轻国家与对象之间的相互作用 。 此外, 我们设计了一个国家过渡模块, 以增加培训构成的多样性, 改进识别模型的坚固性 。 广泛的实验显示我们最近的方法, 基准数据 。