Deep learning algorithms have shown promising results in visual question answering (VQA) tasks, but a more careful look reveals that they often do not understand the rich signal they are being fed with. To understand and better measure the generalization capabilities of VQA systems, we look at their robustness to counterfactually augmented data. Our proposed augmentations are designed to make a focused intervention on a specific property of the question such that the answer changes. Using these augmentations, we propose a new robustness measure, Robustness to Augmented Data (RAD), which measures the consistency of model predictions between original and augmented examples. Through extensive experimentation, we show that RAD, unlike classical accuracy measures, can quantify when state-of-the-art systems are not robust to counterfactuals. We find substantial failure cases which reveal that current VQA systems are still brittle. Finally, we connect between robustness and generalization, demonstrating the predictive power of RAD for performance on unseen augmentations.
翻译:深度学习算法在视觉问题解答( VQA) 任务中显示了有希望的结果, 但更仔细的观察显示, 它们往往不理解它们所喂养的丰富信号。 为了理解和更好地衡量 VQA 系统的普及能力, 我们审视它们是否坚固, 以反效果增强的数据。 我们提议的增强功能旨在对问题的某个特性进行重点干预, 从而改变答案。 使用这些增强功能, 我们提出了一种新的稳健度衡量标准, 强力到增强的数据( RAD), 以衡量模型预测在原始和扩充的示例之间的一致性。 我们通过广泛的实验, 显示RAD, 不同于经典的精确度度衡量标准, 可以在最先进的系统不健全时量化反效果。 我们发现大量失败案例, 表明当前的 VQA 系统仍然模糊。 最后, 我们将强性和概括性连接起来, 显示雷达在看不见的增强力上的表现的预测力 。