Few-shot classification (FSC) entails learning novel classes given only a few examples per class after a pre-training (or meta-training) phase on a set of base classes. Recent works have shown that simply fine-tuning a pre-trained Vision Transformer (ViT) on new test classes is a strong approach for FSC. Fine-tuning ViTs, however, is expensive in time, compute and storage. This has motivated the design of parameter efficient fine-tuning (PEFT) methods which fine-tune only a fraction of the Transformer's parameters. While these methods have shown promise, inconsistencies in experimental conditions make it difficult to disentangle their advantage from other experimental factors including the feature extractor architecture, pre-trained initialization and fine-tuning algorithm, amongst others. In our paper, we conduct a large-scale, experimentally consistent, empirical analysis to study PEFTs for few-shot image classification. Through a battery of over 1.8k controlled experiments on large-scale few-shot benchmarks including Meta-Dataset (MD) and ORBIT, we uncover novel insights on PEFTs that cast light on their efficacy in fine-tuning ViTs for few-shot classification. Through our controlled empirical study, we have two main findings: (i) Fine-tuning just the LayerNorm parameters (which we call LN-Tune) during few-shot adaptation is an extremely strong baseline across ViTs pre-trained with both self-supervised and supervised objectives, (ii) For self-supervised ViTs, we find that simply learning a set of scaling parameters for each attention matrix (which we call AttnScale) along with a domain-residual adapter (DRA) module leads to state-of-the-art performance (while being $\sim\!$ 9$\times$ more parameter-efficient) on MD. Our extensive empirical findings set strong baselines and call for rethinking the current design of PEFT methods for FSC.
翻译:少样本分类(FSC)是指在经过训练(或元训练)一组基本类的情况下,仅有几个每个类的示例情况下来学习新类别。最近的工作表明,简单地对新测试类别的预先训练的视觉Transformer(ViT)进行微调是进行FSC的一种强大方法。但是,对ViTs进行微调在时间、计算和存储方面都比较昂贵。这促使设计参数高效的微调(PEFT)方法,这些方法只微调Transformer的一小部分参数。虽然这些方法表现出了良好的前景,但是实验条件的不一致性使得难以将它们的优势与其他实验因素(包括特征提取器架构、预训练初始化和微调算法等)区分开来。在我们的论文中,我们进行了大规模、实验上一致的经验分析,以研究少样本图像分类的PEFTs。通过对包括Meta-Dataset(MD)和ORBIT在内的大规模少样本基准的超过1.8k个受控实验,我们揭示了关于PEFTs的新见解,这些新见解揭示了它们在对ViTs进行少样本分类微调时的功效。通过我们的受控经验研究,我们有两个主要发现:(i)只微调LayerNorm参数(我们称之为LN-Tune)在ViTs上进行少量适应是一种非常强大的基准线,涵盖了自监督和有监督目的的ViTs,(ii)对于自监督ViTs,我们发现只需学习每个注意力矩阵的一组缩放参数(我们称之为AttnScale),以及一个域剩余适配器(DRA)模块就能在MD上取得最先进的性能(同时效率高达约9倍)。我们广泛的实证研究建立了强大的基准,并呼吁重新考虑FSC的PEFT方法的当前设计。