We propose a novel approach to identify the difficulty of visual questions for Visual Question Answering (VQA) without direct supervision or annotations to the difficulty. Prior works have considered the diversity of ground-truth answers of human annotators. In contrast, we analyze the difficulty of visual questions based on the behavior of multiple different VQA models. We propose to cluster the entropy values of the predicted answer distributions obtained by three different models: a baseline method that takes as input images and questions, and two variants that take as input images only and questions only. We use a simple k-means to cluster the visual questions of the VQA v2 validation set. Then we use state-of-the-art methods to determine the accuracy and the entropy of the answer distributions for each cluster. A benefit of the proposed method is that no annotation of the difficulty is required, because the accuracy of each cluster reflects the difficulty of visual questions that belong to it. Our approach can identify clusters of difficult visual questions that are not answered correctly by state-of-the-art methods. Detailed analysis on the VQA v2 dataset reveals that 1) all methods show poor performances on the most difficult cluster (about 10\% accuracy), 2) as the cluster difficulty increases, the answers predicted by the different methods begin to differ, and 3) the values of cluster entropy are highly correlated with the cluster accuracy. We show that our approach has the advantage of being able to assess the difficulty of visual questions without ground-truth (\ie, the test set of VQA v2) by assigning them to one of the clusters. We expect that this can stimulate the development of novel directions of research and new algorithms.
翻译:我们建议一种新颖的方法,用于识别视觉问答(VQA)的视觉问题难度,而没有直接监督,也没有对困难进行说明。先前的工作考虑了人类注释的地面真实回答的多样性。相反,我们根据多种不同的 VQA 模型的行为分析视觉问题的困难。我们建议对三种不同模型获得的预测答案分布的读数值进行分组:一种作为输入图像和问题的基线方法,以及两种仅作为输入图像和问题使用的变量。我们使用简单的 k 手段将 VQA v2 验证集的视觉问题集中起来。然后我们使用最先进的方法来确定每个分类的答案的准确性和读性。我们建议的方法的一个好处是,不需要对三个不同模型的预测分布进行批数的精确性说明,因为每个组的准确性反映了属于它的视觉问题的难度。我们的方法可以确定那些不正确的视觉方法。关于VQA v2 验证组的视觉问题,然后我们使用最先进的方法来判断每个组的精确性分析,我们用来评估这个组群数的精确度的精确性,我们用来评估这个组数组群数的精确性组的精确度,我们用来评估这个组数组的精确性组数组的精确性研究。