Visual question answering requires high-order reasoning about an image, which is a fundamental capability needed by machine systems to follow complex directives. Recently, modular networks have been shown to be an effective framework for performing visual reasoning tasks. While modular networks were initially designed with a degree of model transparency, their performance on complex visual reasoning benchmarks was lacking. Current state-of-the-art approaches do not provide an effective mechanism for understanding the reasoning process. In this paper, we close the performance gap between interpretable models and state-of-the-art visual reasoning methods. We propose a set of visual-reasoning primitives which, when composed, manifest as a model capable of performing complex reasoning tasks in an explicitly-interpretable manner. The fidelity and interpretability of the primitives' outputs enable an unparalleled ability to diagnose the strengths and weaknesses of the resulting model. Critically, we show that these primitives are highly performant, achieving state-of-the-art accuracy of 99.1% on the CLEVR dataset. We also show that our model is able to effectively learn generalized representations when provided a small amount of data containing novel object attributes. Using the CoGenT generalization task, we show more than a 20 percentage point improvement over the current state of the art.
翻译:视觉问题解答要求对图像进行高层次的推理,这是机器系统遵循复杂指令所需的基本能力。最近,模块化网络被证明是执行视觉推理任务的有效框架。模块化网络最初设计时具有一定的模型透明度,但其在复杂视觉推理基准方面的性能却缺乏。目前最先进的方法并不为理解推理过程提供有效的机制。在本文中,我们缩小了可解释模型与最先进的视觉推理方法之间的性能差距。我们提出了一套视觉定位原始模型,这些原始模型在组成时显示为能够以明确可解释的方式执行复杂推理任务的模型。原始输出的忠实性和可解释性使得能够无与伦比地分析所产生模型的优点和弱点。关键地是,我们显示这些原始模型的性能很高,在CLEVR数据集中达到99.1%的状态精确度。我们还表明,当提供少量包含新物体属性的数据时,我们模型能够有效地学习通用的表述。我们利用CGT的普遍改进点显示比目前20个百分点的状态。我们展示了比现在的改进点要多。