In compositional zero-shot learning, the goal is to recognize unseen compositions (e.g. old dog) of observed visual primitives states (e.g. old, cute) and objects (e.g. car, dog) in the training set. This is challenging because the same state can for example alter the visual appearance of a dog drastically differently from a car. As a solution, we propose a novel graph formulation called Compositional Graph Embedding (CGE) that learns image features, compositional classifiers, and latent representations of visual primitives in an end-to-end manner. The key to our approach is exploiting the dependency between states, objects, and their compositions within a graph structure to enforce the relevant knowledge transfer from seen to unseen compositions. By learning a joint compatibility that encodes semantics between concepts, our model allows for generalization to unseen compositions without relying on an external knowledge base like WordNet. We show that in the challenging generalized compositional zero-shot setting our CGE significantly outperforms the state of the art on MIT-States and UT-Zappos. We also propose a new benchmark for this task based on the recent GQA dataset. Code is available at: https://github.com/ExplainableML/czsl
翻译:在成份零光学习中,目标是在培训集中承认观测到的视觉原始状态(例如旧的、可爱的)和对象(例如汽车、狗)的无形构成(例如旧的、可爱的旧狗)和物体(例如汽车、狗),这是具有挑战性的,因为同样的状态可以大大改变狗的视觉外观与汽车的相貌。作为一种解决办法,我们提议了一个名为“成文图嵌入”的新图表配方(CGE),它以端到端的方式学习视觉原始的图像特征、成文分类和潜在表现。我们的方法的关键是在图表结构中利用各州、对象及其组成之间的依赖性,以将相关知识从可见的成文转化为看不见的成文。我们通过学习将概念之间的语义编码为词义学的共同兼容性,我们的模型可以使看不见的成文的成文,而不必依赖WordNet这样的外部知识基础。我们表明,在具有挑战性的通用成文的成文的零点设置中,我们的近端到端至端方的成品国和UT-Zapos。我们还在图表结构结构结构中提出了一个新的基准:在最新的GDC/Ex/Exta/Exlabs。