Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, e.g., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the \textbf{U}nified and \textbf{P}r\textbf{o}gressive \textbf{P}runing (UPop) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on multiple generative and discriminative vision-language tasks, including Visual Reasoning, Image Caption, Visual Question Answer, Image-Text Retrieval, Text-Image Retrieval, and Image Classification, demonstrate the effectiveness and versatility of the proposed UPop framework.
翻译:现实世界数据包含大量多式联运信息,其中视觉和语言是两种最具代表性的模式。此外,越来越重的模型,例如变异器,已经吸引研究人员对模型压缩的关注。然而,如何压缩多式联运模型,特别是语言变异器的压缩模型,仍然未得到充分探讨。本文建议采用\textbf{U}ififif和\ textbf{P}r\ textbf{{p}pr}r\textbf{P}running(UPop)作为通用的语言变异器压缩框架,其中包括1)在原模型的连续优化空间中统一搜索多式联运子网,从而能够在可压缩模式和结构之间自动分配调整比率;2)逐步搜索和再培训该子网,该子网保持搜索和再融合,以达到更高的压缩率。在多个变异和歧视性的视觉语言任务上进行了实验,包括视觉理性、图像能力、视觉问题解答、图像反位检索、文字检索和图像分类框架的有效性和最新版本。