Large pre-trained models have proved to be remarkable zero- and (prompt-based) few-shot learners in unimodal vision and language tasks. We propose MAPL, a simple and parameter-efficient method that reuses frozen pre-trained unimodal models and leverages their strong generalization capabilities in multimodal vision-language (VL) settings. MAPL learns a lightweight mapping between the representation spaces of unimodal models using aligned image-text data, and can generalize to unseen VL tasks from just a few in-context examples. The small number of trainable parameters makes MAPL effective at low-data and in-domain learning. Moreover, MAPL's modularity enables easy extension to other pre-trained models. Extensive experiments on several visual question answering and image captioning benchmarks show that MAPL achieves superior or competitive performance compared to similar methods while training orders of magnitude fewer parameters. MAPL can be trained in just a few hours using modest computational resources and public datasets. We release our code and pre-trained model weights at https://github.com/mair-lab/mapl.
翻译:经事先培训的大型模型在单一方式愿景和语言任务方面被证明是显著的零和(即时的)微小学习者。我们提出MAPL,这是一种简单和有参数效率的方法,可以重新使用经过事先培训的单一方式模型,并在多式愿景语言(VL)环境中利用其强大的一般化能力。MAPL学会了使用统一的图像文本数据在单一方式模型代表空间之间进行轻量的绘图,并且可以从仅有的几个文本中概括到看不见的VL任务。这些少量的可培训参数使得MAPL在低数据和日常学习方面有效。此外,MAPL的模块化使得很容易推广到其他经过培训的模型。关于几个视觉问题回答和图像描述基准的广泛实验表明,MAPL在类似方法下取得了优异性或竞争性的性能,而培训的参数则要小得多。使用少量的计算资源和公共数据集,可以进行短几个小时的培训。我们在 https://github.com/mair-lab/mapl上发布我们的代码和事先训练过的模型重量。我们公布在https://github.com/mair-lab/mapl。</s>