Inferring and Executing Programs for Visual Reasoning proposes a model for visual reasoning that consists of a program generator and an execution engine to avoid end-to-end models. To show that the model actually learns which objects to focus on to answer the questions, the authors give a visualization of the norm of the gradient of the sum of the predicted answer scores with respect to the final feature map. However, the authors do not evaluate the efficiency of focus map. This paper purposed a method for evaluating it. We generate several kinds of questions to test different keywords. We infer focus maps from the model by asking these questions and evaluate them by comparing with the segmentation graph. Furthermore, this method can be applied to any model if focus maps can be inferred from it. By evaluating focus map of different models on the CLEVR dataset, we will show that CLEVR-iep model has learned where to focus more than end-to-end models.
翻译:视觉理性的推断和执行程序提出了视觉推理模型,其中包括一个程序生成器和一个执行引擎,以避免终端到终端模型。为了显示模型实际学习了哪些反对焦点来回答问题,作者们对最终特征地图的预测回答分数之和的梯度标准进行了可视化。然而,作者们并没有评估焦点图的效率。本文的目的是要用一种方法来评价它。我们产生了几种问题来测试不同的关键词。我们通过询问这些问题来推断模型的重点地图,并通过比较分解图来评估它们。此外,如果可以从中推断出焦点图,这种方法可以适用于任何模型。通过对CLEVR数据集中不同模型的焦点地图进行评估,我们将表明CLEVR-iep模型已经学会了比终端到终端模型更侧重于哪里。