Generalization to out-of-distribution data has been a problem for Visual Question Answering (VQA) models. To measure generalization to novel questions, we propose to separate them into "skills" and "concepts". "Skills" are visual tasks, such as counting or attribute recognition, and are applied to "concepts" mentioned in the question, such as objects and people. VQA methods should be able to compose skills and concepts in novel ways, regardless of whether the specific composition has been seen in training, yet we demonstrate that existing models have much to improve upon towards handling new compositions. We present a novel method for learning to compose skills and concepts that separates these two factors implicitly within a model by learning grounded concept representations and disentangling the encoding of skills from that of concepts. We enforce these properties with a novel contrastive learning procedure that does not rely on external annotations and can be learned from unlabeled image-question pairs. Experiments demonstrate the effectiveness of our approach for improving compositional and grounding performance.
翻译:将数据普遍化是视觉问答(VQA)模式的一个问题。为了衡量对新问题的概括性,我们提议将它们分为“技能”和“概念”。“技能”是视觉任务,如计数或属性识别,并应用于问题中提到的“概念”,如对象和人。 VQA方法应该能够以新的方式拼写技能和概念,而不论培训中是否看到具体的构成,但我们证明,在处理新构成方面,现有模式有很大改进余地。我们提出了一种新颖的学习方法,通过学习基于概念的表述和将技能与概念的编码脱钩,将这两个因素隐含在模型中,将技能和概念区分开来。我们用一种新的对比学习程序执行这些特性,不依赖外部说明,也可以从未贴标签的图像问题配对中学习。实验表明,我们改进构成和基础性表现的方法是有效的。