We study the problem of concept induction in visual reasoning, i.e., identifying concepts and their hierarchical relationships from question-answer pairs associated with images; and achieve an interpretable model via working on the induced symbolic concept space. To this end, we first design a new framework named object-centric compositional attention model (OCCAM) to perform the visual reasoning task with object-level visual features. Then, we come up with a method to induce concepts of objects and relations using clues from the attention patterns between objects' visual features and question words. Finally, we achieve a higher level of interpretability by imposing OCCAM on the objects represented in the induced symbolic concept space. Our model design makes this an easy adaption via first predicting the concepts of objects and relations and then projecting the predicted concepts back to the visual feature space so the compositional reasoning module can process normally. Experiments on the CLEVR and GQA datasets demonstrate: 1) our OCCAM achieves a new state of the art without human-annotated functional programs; 2) our induced concepts are both accurate and sufficient as OCCAM achieves an on-par performance on objects represented either in visual features or in the induced symbolic concept space.
翻译:我们在视觉推理中研究概念感应问题,即从与图像相关的问答对应中确定概念及其等级关系;通过在诱导的象征性概念空间上开展工作,实现一个可解释模型;为此,我们首先设计一个名为以物体为中心的构件关注模型(OCCAM)的新框架,用物体的视觉特征执行视觉推理任务;然后,我们想出一种方法,利用从物体的视觉特征和问题词的注意模式中获得的线索来引导物体概念和关系的概念;最后,我们通过对诱导的象征性概念空间所代表的物体实施OCAM,实现更高程度的可解释性。我们的模型设计使这种可解释性变得容易,先预测物体的概念和关系的概念,然后将预测的概念投射回视觉特征空间,这样,构件推理模块就能正常地处理。对CLEVR和GQA数据集的实验表明:(1)我们的OCAM在没有人注解功能程序的情况下实现艺术的新状态;(2)我们引导的概念既准确又充分,因为OCAM在视觉特征或图像中代表的物体上实现在图像上的横向性功能。