In compositional zero-shot learning, the goal is to recognize unseen compositions (e.g. old dog) of observed visual primitives states (e.g. old, cute) and objects (e.g. car, dog) in the training set. This is challenging because the same state can for example alter the visual appearance of a dog drastically differently from a car. As a solution, we propose a novel graph formulation called Compositional Graph Embedding (CGE) that learns image features, compositional classifiers, and latent representations of visual primitives in an end-to-end manner. The key to our approach is exploiting the dependency between states, objects, and their compositions within a graph structure to enforce the relevant knowledge transfer from seen to unseen compositions. By learning a joint compatibility that encodes semantics between concepts, our model allows for generalization to unseen compositions without relying on an external knowledge base like WordNet. We show that in the challenging generalized compositional zero-shot setting our CGE significantly outperforms the state of the art on MIT-States and UT-Zappos. We also propose a new benchmark for this task based on the recent GQA dataset.
翻译:在成份零光学习中,目标是在培训集中识别观察的视觉原始状态(例如旧的、可爱的)和物体(例如汽车、狗)的无形构成(例如旧的、可爱的旧狗)和物体(例如汽车、狗),这是具有挑战性的,因为同样的状态可以大大改变狗的视觉外观与汽车的视觉外观。作为一种解决办法,我们提议了一个名为“成文图嵌入”的新型图表配方(CGE),它以端至端的方式学习视觉原始的图像特征、成文分类和潜在表现。我们的方法的关键是在图表结构中利用各州、对象及其组成之间的依赖性,以将相关知识从可见的成份转变为看不见的成份。我们通过学习将概念之间的语义编码为共兼容性的联合兼容性,我们的模型允许在不依赖WordNet等外部知识基础的情况下将看不见的成形。我们指出,在具有挑战性的通用成文的零点设置中,我们的化组合大大超越了MIT-State和UT-Zapos的艺术状况。我们还提出了基于最近G-A任务的新基准。