Multimodal few-shot learning is challenging due to the large domain gap between vision and language modalities. Existing methods are trying to communicate visual concepts as prompts to frozen language models, but rely on hand-engineered task induction to reduce the hypothesis space. To make the whole process learnable, we introduce a multimodal meta-learning approach. Specifically, our approach decomposes the training of the model into a set of related multimodal few-shot tasks. We define a meta-mapper network, acting as a meta-learner, to efficiently bridge frozen large-scale vision and language models and leverage their already learned capacity. By updating the learnable parameters only of the meta-mapper, it learns to accrue shared meta-knowledge among these tasks. Thus, it can rapidly adapt to newly presented samples with only a few gradient updates. Importantly, it induces the task in a completely data-driven manner, with no need for a hand-engineered task induction. We evaluate our approach on recently proposed multimodal few-shot benchmarks, measuring how rapidly the model can bind novel visual concepts to words and answer visual questions by observing only a limited set of labeled examples. The experimental results show that our meta-learning approach outperforms the baseline across multiple datasets and various training settings while being computationally more efficient.
翻译:由于视野和语言模式之间巨大的领域差距,多式少发学习具有挑战性。现有方法试图将视觉概念作为阻塞的语言模型的提示来传播,但依靠手工设计的任务感应来减少假设空间。为了使整个过程可以学习,我们引入了一种多式联运元学习方法。具体地说,我们的方法将模型培训分解成一套相关的多式联运少发任务。我们定义了一个元模型网络,作为一个元数据分解器,以有效地弥合冻结的大型愿景和语言模型,并利用其已经学习的能力。通过仅更新元数据模型的可学习参数,它学会在这些任务中积累共享的元知识。因此,它可以迅速适应新推出的样本,只有少量的梯度更新。重要的是,它以完全数据驱动的方式引导任务,而不需要手工设计的任务简介。我们评估了我们最近提出的元数据分光基准的方法,我们衡量模型如何迅速将新视觉概念与言词结合,并通过只观察一套有限的标签示例回答视觉问题。实验结果显示,在多种数据计算方法上,而实验结果显示我们的数据是跨元学习模式。</s>