Despite the excellent performance of large-scale vision-language pre-trained models (VLPs) on conventional visual question answering task, they still suffer from two problems: First, VLPs tend to rely on language biases in datasets and fail to generalize to out-of-distribution (OOD) data. Second, they are inefficient in terms of memory footprint and computation. Although promising progress has been made in both problems, most existing works tackle them independently. To facilitate the application of VLP to VQA tasks, it is imperative to jointly study VLP compression and OOD robustness, which, however, has not yet been explored. In this paper, we investigate whether a VLP can be compressed and debiased simultaneously by searching sparse and robust subnetworks. To this end, we conduct extensive experiments with LXMERT, a representative VLP, on the OOD dataset VQA-CP v2. We systematically study the design of a training and compression pipeline to search the subnetworks, as well as the assignment of sparsity to different modality-specific modules. Our results show that there indeed exist sparse and robust LXMERT subnetworks, which significantly outperform the full model (without debiasing) with much fewer parameters. These subnetworks also exceed the current SoTA debiasing models with comparable or fewer parameters. We will release the codes on publication.
翻译:尽管在常规视觉问题解答任务方面,大型视觉语言预先培训模型(VLP)的出色表现,但它们仍然有两个问题:第一,VLP往往在数据集中依赖语言偏见,而没有概括分配(OOOD)数据;第二,它们在记忆足迹和计算方面效率低下;尽管在这两个问题上都取得了有希望的进展,但大多数现有工作都独立地解决了这些问题;为了便利将VLP应用到VQA任务,必须联合研究VLP压缩和OOOD稳健性,然而,这两个问题尚未探讨。在本文件中,我们调查VLP是否在数据集中依赖语言偏见,而没有同时搜索分散和强大的子网络数据。为此,我们与具有代表性的LXMERT进行广泛的实验。尽管在OD数据集VQA-CP v2. 我们系统地研究设计培训和压缩管道以搜索子网络的任务,以及向不同模式特定模块分配孔径。我们的结果显示,目前版本的参数将大大缺乏和不牢固的版本。