Benefiting from large-scale Pretrained Vision-Language Models (VL-PMs), the performance of Visual Question Answering (VQA) has started to approach human oracle performance. However, finetuning large-scale VL-PMs with limited data for VQA usually faces overfitting and poor generalization issues, leading to a lack of robustness. In this paper, we aim to improve the robustness of VQA systems (ie, the ability of the systems to defend against input variations and human-adversarial attacks) from the perspective of Information Bottleneck when finetuning VL-PMs for VQA. Generally, internal representations obtained by VL-PMs inevitably contain irrelevant and redundant information for the downstream VQA task, resulting in statistically spurious correlations and insensitivity to input variations. To encourage representations to converge to a minimal sufficient statistic in vision-language learning, we propose the Correlation Information Bottleneck (CIB) principle, which seeks a tradeoff between representation compression and redundancy by minimizing the mutual information (MI) between the inputs and internal representations while maximizing the MI between the outputs and the representations. Meanwhile, CIB measures the internal correlations among visual and linguistic inputs and representations by a symmetrized joint MI estimation. Extensive experiments on five VQA benchmarks of input robustness and two VQA benchmarks of human-adversarial robustness demonstrate the effectiveness and superiority of the proposed CIB in improving the robustness of VQA systems.
翻译:从大规模预先培训的视觉语言模型(VL-Language Models(VL-PM)中受益,视觉问题解答(VQA)的业绩已开始接近人文或骨骼业绩,然而,微调大型VL-PM(VQA)的数据有限,对VQA的大规模VL-PM(VQA)通常会面临过于完善和不够全面化的问题,导致缺乏稳健性;在本文件中,我们从信息瓶颈的角度出发,提高VQA系统(即系统防范投入差异和对抗性攻击的能力)的稳健性,从信息瓶颈的角度出发。 一般而言,VL-PM(VQA)获得的内部表现必然含有与下游VQA任务无关和多余的信息,导致统计上的虚伪相关性和对投入变化的敏感性。为了鼓励表达,在愿景-语言学学习方面达到最低限度的充分统计,我们提议采用CBIB的关联性信息博特内克(C-IB)原则,力求通过尽可能减少投入和内部陈述的相互信息,同时通过IMIB的稳健健健健健的C产出和图像和图示,使IBIB的硬性指标和内部分析的硬性指标的硬性基准和硬性指标的硬性基准,实现。