Reasoning about visual relationships is central to how humans interpret the visual world. This task remains challenging for current deep learning algorithms since it requires addressing three key technical problems jointly: 1) identifying object entities and their properties, 2) inferring semantic relations between pairs of entities, and 3) generalizing to novel object-relation combinations, i.e., systematic generalization. In this work, we use vision transformers (ViTs) as our base model for visual reasoning and make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs. Specifically, we introduce a novel concept-feature dictionary to allow flexible image feature retrieval at training time with concept keys. This dictionary enables two new concept-guided auxiliary tasks: 1) a global task for promoting relational reasoning, and 2) a local task for facilitating semantic object-centric correspondence learning. To examine the systematic generalization of visual reasoning models, we introduce systematic splits for the standard HICO and GQA benchmarks. We show the resulting model, Concept-guided Vision Transformer (or RelViT for short) significantly outperforms prior approaches on HICO and GQA by 16% and 13% in the original split, and by 43% and 18% in the systematic split. Our ablation analyses also reveal our model's compatibility with multiple ViT variants and robustness to hyper-parameters.
翻译:视觉关系解释是人类如何解读视觉世界的核心所在。 这项任务对于当前深层次的学习算法仍然具有挑战性,因为它需要共同解决三个关键技术问题:(1) 识别对象实体及其属性,(2) 推断对等实体之间的语义关系,(3) 概括新颖的物体关系组合,即系统化的概括化。在这项工作中,我们使用视觉变压器(ViTs)作为我们的视觉推理基础模型,更好地利用被界定为对象实体的概念及其关系来提高ViTs的推理能力。具体地说,我们引入了一个新的概念字典,允许在培训时用概念键来灵活检索图像特征。这本字典提供了两种新概念引导的辅助任务:(1) 促进关联推理的全球任务,和(2) 便利语义性对象中心通信学习的本地任务。为了审查视觉推理模型的系统化概括化,我们为标准HiCO和GQA基准引入系统化的分解方法,我们由此得出的模型、 概念指导变换器(或RelViT) 概念- 概念转换器(或RevilVT) 在培训时允许在使用概念定位键键时, 13- 和系统化的18 % 分析中以原始分化的13 和18 和系统化的系统化的13 和18的系统化的系统化的系统化的系统化的系统化图式的13- 和13-IA 和13- 和18的13-I- 分析法化方法,我们分解法化方法,我们的13- 和18 和18的分化的分解法化方法,我们的13- 和13- 和G-III-III-III-III-III-III-III-III-III-III-I-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-III-