Visual Question Answering (VQA) models should have both high robustness and accuracy. Unfortunately, most of the current VQA research only focuses on accuracy because there is a lack of proper methods to measure the robustness of VQA models. There are two main modules in our algorithm. Given a natural language question about an image, the first module takes the question as input and then outputs the ranked basic questions, with similarity scores, of the main given question. The second module takes the main question, image and these basic questions as input and then outputs the text-based answer of the main question about the given image. We claim that a robust VQA model is one, whose performance is not changed much when related basic questions as also made available to it as input. We formulate the basic questions generation problem as a LASSO optimization, and also propose a large scale Basic Question Dataset (BQD) and Rscore (novel robustness measure), for analyzing the robustness of VQA models. We hope our BQD will be used as a benchmark for to evaluate the robustness of VQA models, so as to help the community build more robust and accurate VQA models.
翻译:视觉问题解答( VQA) 模式应该具有高度的稳健性和准确性。 不幸的是, 目前的 VQA 研究大多只侧重于准确性, 因为缺乏衡量 VQA 模型稳健性的适当方法。 我们的算法中有两个主要的模块。 鉴于关于图像的自然语言问题, 第一个模块将问题作为输入, 然后输出一个主题的排位基本问题, 其分数相似性分数。 第二个模块将主要问题、 图像和这些基本问题作为输入, 然后输出对特定图像主要问题的基于文本的回答。 我们声称, 一个强大的 VQA 模型是一个, 当相关的基本问题也作为投入提供给它时,其性能没有太大变化。 我们把产生的基本问题作为 LASSO 优化, 并提出一个大规模的基本问题数据集( BQD) 和 Rscore (nvel 稳性度度度度度度度度度度度度度), 用于分析 VQA 模型的稳健性。 我们希望我们的 BQD 将用作评估 VQA 模型是否稳健性的基准, 以便帮助社区建立更准确和 QA 模型。