Existing Visual Question Answering (VQA) models are often fragile and sensitive to input variations. In this paper, we propose a novel approach to address this issue based on modular networks, which creates two questions related by linguistic perturbations and regularizes the visual reasoning process between them to be consistent during training. We show that our framework markedly improves consistency and generalization ability, demonstrating the value of controlled linguistic perturbations as a useful and currently underutilized training and regularization tool for VQA models. We also present VQA Perturbed Pairings (VQA P2), a new, low-cost benchmark and augmentation pipeline to create controllable linguistic variations of VQA questions. Our benchmark uniquely draws from large-scale linguistic resources, avoiding human annotation effort while maintaining data quality compared to generative approaches. We benchmark existing VQA models using VQA P2 and provide robustness analysis on each type of linguistic variation.
翻译:现有视觉问题解答(VQA)模式往往很脆弱,而且对投入差异敏感。在本文件中,我们提出以模块网络为基础解决这一问题的新办法,由此产生两个与语言扰动有关的问题,并使两者之间的视觉推理过程规范化,以便在培训期间保持一致。我们表明,我们的框架明显提高了一致性和概括化能力,显示了受控语言扰动作为VQA模式有用和目前利用不足的培训和规范化工具的价值。我们还提出了VQA Perturbed pairings(VQA P2),一种新的低成本基准和增强管道,为VQA问题创造可控制的语言变异。我们的基准从大规模语言资源中独具特色,避免人文说明努力,同时保持数据质量,同时与典型方法相比。我们用VQA P2作为现有的VQA模型的基准,并对每一种语言变异提供稳健分析。