Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era. Following this trend, the size of multi-modal learning models constantly increases, leading to an urgent need to reduce the massive computational cost of finetuning these models for downstream tasks. In this paper, we propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers. Specifically, we first present a modular multimodal fusion framework that exhibits high flexibility and facilitates mutual interactions among different modalities. In addition, we disentangle vanilla prompts into three types in order to learn different optimizing objectives for multimodal learning. It is also worth noting that we propose to add prompt vectors only on the deep layers of the unimodal transformers, thus significantly reducing the training memory usage. Experiment results show that our proposed method achieves comparable performance to several other multimodal finetuning methods with less than 3% trainable parameters and up to 66% saving of training memory usage.
翻译:多模态交互提示的有效融合
恶意训练带来的大规模预训练给计算机视觉和自然语言处理等单模态领域带来了新的发展。随着多模态学习模型的不断增加,必须迫切缩减将这些模型细调为下游任务所需的大量计算成本。本文提出了一种有效且灵活的多模态融合方法PMF,旨在融合单模态预训练变压器。具体而言,我们首先提出了一个模块化多模态融合框架,具备高度灵活性,有利于不同模态之间的相互作用。此外,我们将香草提示分为三种类型,以学习不同的多模态学习优化目标。值得注意的是,我们仅将提示向量添加到单模态变压器的深层,从而显着减少训练内存使用。实验结果显示,我们提出的方法在少于3%可训练参数和高达66%的训练内存使用节省方面具有与其他几种多模态微调方法相当的性能。