Visual Question Answering (VQA) has been a popular task that combines vision and language, with numerous relevant implementations in literature. Even though there are some attempts that approach explainability and robustness issues in VQA models, very few of them employ counterfactuals as a means of probing such challenges in a model-agnostic way. In this work, we propose a systematic method for explaining the behavior and investigating the robustness of VQA models through counterfactual perturbations. For this reason, we exploit structured knowledge bases to perform deterministic, optimal and controllable word-level replacements targeting the linguistic modality, and we then evaluate the model's response against such counterfactual inputs. Finally, we qualitatively extract local and global explanations based on counterfactual responses, which are ultimately proven insightful towards interpreting VQA model behaviors. By performing a variety of perturbation types, targeting different parts of speech of the input question, we gain insights to the reasoning of the model, through the comparison of its responses in different adversarial circumstances. Overall, we reveal possible biases in the decision-making process of the model, as well as expected and unexpected patterns, which impact its performance quantitatively and qualitatively, as indicated by our analysis.
翻译:视觉问题解答(VQA)是一项受欢迎的任务,它结合了视觉和语言,在文献中也有许多相关的执行。尽管有些尝试试图在VQA模型中解释解释性和稳健性问题,但其中很少有人采用反事实,作为以模型和不可知性方式检验这类挑战的手段。在这项工作中,我们提出了一个系统的方法来解释VQA模型的行为,并通过反事实干扰来调查其稳健性。为此,我们利用结构化的知识基础,针对语言模式进行确定性、最佳和可控制的字级替换,然后评估模型对反事实投入的反应。最后,我们从质量上根据反事实反应提取当地和全球的解释,这些解释最终被证明对解释VQA模型行为有洞察力。通过对投入问题不同部分的演讲,我们通过比较其在不同敌对环境中的反应,了解模型的推理。总体而言,我们揭示了模型在决策过程中可能存在的偏差,而这种偏差是以这种反事实的投入。我们预期的定性和定性分析模式显示了我们预期的定性和定性分析结果。</s>