Visual question answering (VQA) is challenging not only because the model has to handle multi-modal information, but also because it is just so hard to collect sufficient training examples -- there are too many questions one can ask about an image. As a result, a VQA model trained solely on human-annotated examples could easily over-fit specific question styles or image contents that are being asked, leaving the model largely ignorant about the sheer diversity of questions. Existing methods address this issue primarily by introducing an auxiliary task such as visual grounding, cycle consistency, or debiasing. In this paper, we take a drastically different approach. We found that many of the "unknowns" to the learned VQA model are indeed "known" in the dataset implicitly. For instance, questions asking about the same object in different images are likely paraphrases; the number of detected or annotated objects in an image already provides the answer to the "how many" question, even if the question has not been annotated for that image. Building upon these insights, we present a simple data augmentation pipeline SimpleAug to turn this "known" knowledge into training examples for VQA. We show that these augmented examples can notably improve the learned VQA models' performance, not only on the VQA-CP dataset with language prior shifts but also on the VQA v2 dataset without such shifts. Our method further opens up the door to leverage weakly-labeled or unlabeled images in a principled way to enhance VQA models. Our code and data are publicly available at https://github.com/heendung/simpleAUG.
翻译:视觉解答( VQA) 具有挑战性, 不仅因为模型必须处理多模式信息, 而且还因为收集足够的培训实例非常困难 -- -- 人们可以对图像提出太多问题。 结果, 仅以人类附加说明示例培训的 VQA 模型可能很容易地过度适应特定的问题样式或图像内容, 使得模型基本上不了解问题的多样性。 现有方法主要通过引入一个辅助任务来解决这一问题, 如视觉地基、 周期一致性或贬低性能。 在本文中, 我们采取了截然不同的方法。 我们发现, 学习过的 VQA 模型的许多“ 未知” 的“ 未知” 模型确实在数据集中“ 已知” 。 例如, 不同图像中询问同一对象的问题可能是副词; 图像中检测到或附加说明的对象的数量已经为“ 多少” 问题提供了答案, 即使问题尚未对该图像作出进一步说明。 基于这些洞察, 我们展示了一个简单的数据增强管道简单分析, 将这个“ 已知的” QA 知识转化为 VA 先前的数据模型。 我们只能通过 VA 工具来提升我们VA 的变换数据 。