Deep neural networks have been critical in the task of Visual Question Answering (VQA), with research traditionally focused on improving model accuracy. Recently, however, there has been a trend towards evaluating the robustness of these models against adversarial attacks. This involves assessing the accuracy of VQA models under increasing levels of noise in the input, which can target either the image or the proposed query question, dubbed the main question. However, there is currently a lack of proper analysis of this aspect of VQA. This work proposes a new method that utilizes semantically related questions, referred to as basic questions, acting as noise to evaluate the robustness of VQA models. It is hypothesized that as the similarity of a basic question to the main question decreases, the level of noise increases. To generate a reasonable noise level for a given main question, a pool of basic questions is ranked based on their similarity to the main question, and this ranking problem is cast as a LASSO optimization problem. Additionally, this work proposes a novel robustness measure, R_score, and two basic question datasets to standardize the analysis of VQA model robustness. The experimental results demonstrate that the proposed evaluation method effectively analyzes the robustness of VQA models. Moreover, the experiments show that in-context learning with a chain of basic questions can enhance model accuracy.
翻译:深度神经网络在视觉问答(VQA)任务中发挥了至关重要的作用,传统研究侧重于提高模型准确性。然而,最近出现了一种趋势,即评估这些模型对抗性攻击的鲁棒性。这涉及在输入中增加噪声的级别,可以针对图像或所提出的查询问题(称为主要问题),并且这样可以评估VQA模型的准确性。然而,目前对VQA的这一方面缺乏适当的分析。本文提出了一种使用语义相关问题——称为基本问题——作为噪声评估VQA模型鲁棒性的新方法。假定随着基本问题与主要问题的相似度降低,噪声水平也会增加。为了为给定的主要问题生成合理的噪声水平,基于它们与主要问题的相似性对一组基本问题进行排名,并将这个排名问题视为一个LASSO优化问题。此外,本文提出了一种新的鲁棒性测量R_score和两个基本问题数据集,以标准化VQA模型鲁棒性分析。实验结果表明,所提出的评估方法有效地分析了VQA模型的鲁棒性。此外,实验表明,在一系列基本问题的上下文学习中,可以提高模型的准确性。