Benefiting from large-scale pretrained vision language models (VLMs), the performance of Visual Question Answering (VQA) has approached human oracle performance. However, finetuning large-scale pretrained VLMs with limited data usually suffers from overfitting and poor generalization issues, leading to a lack of model robustness. In this paper, we aim to improve the input robustness, \ie the ability of models to defend against visual and linguistic input variations as well as shortcut learning involved in inputs, from the perspective of Information Bottleneck when adapting pretrained VLMs to the downstream VQA task. Generally, internal representations obtained by pretrained VLMs inevitably contain irrelevant and redundant information for a specific downstream task, resulting in statistically spurious correlations and insensitivity to input variations. To encourage the obtained representations to converge to a minimal sufficient statistic in vision-language learning, we propose the Correlation Information Bottleneck (CIB) principle, which seeks a tradeoff between representation compression and redundancy by minimizing the mutual information (MI) between inputs and internal representations while maximizing the MI between outputs and the representations. Furthermore, CIB measures the internal correlations among visual and linguistic inputs and representations via a symmetrized joint MI estimation. Extensive experiments on five VQA datasets of input robustness demonstrate the effectiveness and superiority of the proposed CIB in terms of robustness and accuracy.
翻译:从大规模预先培训的视觉语言模型(VLM)中受益,视觉问答(VQA)的性能接近于人文标准;然而,用有限的数据微调大规模预先培训的VLM(VQA)的性能,通常由于适应性过大和一般化问题差,导致模型缺乏稳健性;在本文件中,我们从信息瓶颈的角度出发,力求提高投入的稳健性,使模型能够防范视觉和语言投入差异,以及参与投入的捷径学习,从信息瓶颈的角度出发,使经过培训的VQA任务适应下游的VLMS(VQA)的性能;一般而言,预先培训的VLMS的内部表述必然含有与具体下游任务无关和多余的信息,导致统计上过于生动的关联和对投入变化的敏感性;为了鼓励所获得的表述在愿景语言学习方面达到最低限度的充分统计,我们提议采用CBIF原则,通过尽可能减少投入与内部陈述之间的相互信息,同时尽量扩大MI投入与内部陈述之间的准确性;此外,CIB对C的准确性和通过图像性联合分析的准确性进行了内部对比。</s>