Despite achieving state-of-the-art zero-shot performance, existing vision-language models still fall short of few-shot transfer ability on domain-specific problems. Classical fine-tuning often fails to prevent highly expressive models from exploiting spurious correlations. Although model-agnostic meta-learning (MAML) presents as a natural alternative for few-shot transfer learning, the expensive computation due to implicit second-order optimization limits its use on large-scale vision-language models such as CLIP. While much literature has been devoted to exploring alternative optimization strategies, we identify another essential aspect towards effective few-shot transfer learning, task sampling, which is previously only be viewed as part of data pre-processing in MAML. To show the impact of task sampling, we propose a simple algorithm, Model-Agnostic Multitask Fine-tuning (MAMF), which differentiates classical fine-tuning only on uniformly sampling multiple tasks. Despite its simplicity, we show that MAMF consistently outperforms classical fine-tuning on five few-shot vision-language classification tasks. We further show that the effectiveness of the bi-level optimization in MAML is highly sensitive to the zero-shot performance of a task in the context of few-shot vision-language classification. The goal of this paper is to provide new insights on what makes few-shot learning work, and encourage more research into investigating better task sampling strategies.
翻译:尽管实现了最先进的零点表现,但现有的视觉语言模型仍然没有达到在特定领域问题上的微小传输能力。典型微调往往未能防止高度直观模型利用虚假的关联性。虽然模型 -- -- 不可知性元学习(MAML)是少见的转移学习的一种自然选择,但由于隐含的二级优化造成的昂贵计算限制了其在大型视觉语言模型(如CLIP)中的使用。虽然大量文献都用于探索替代性优化战略,但我们发现另一个重要方面,即有效的微小传输学习、任务抽样(以前仅被视为MAML数据预处理的一部分)。为了显示任务抽样的影响,我们提议了一个简单的算法,即模型 -- -- 不可知性多任务微调(MAML)微调(ML)作为自然的替代方法,它只是将典型的微调调整局限于统一抽样的多重任务。尽管它很简洁,但我们表明MAML始终在五个微点的视觉语言分类任务上优于典型的微调。我们进一步表明,在MAML的双级优化中,这几个层次的优化只是作为MAML的数据预处理的一部分。我们更敏锐地研究任务对零点研究任务做了更敏锐化工作做了一个目标。