The few-shot learning ability of vision transformers (ViTs) is rarely investigated though heavily desired. In this work, we empirically find that with the same few-shot learning frameworks, \eg~Meta-Baseline, replacing the widely used CNN feature extractor with a ViT model often severely impairs few-shot classification performance. Moreover, our empirical study shows that in the absence of inductive bias, ViTs often learn the low-qualified token dependencies under few-shot learning regime where only a few labeled training data are available, which largely contributes to the above performance degradation. To alleviate this issue, for the first time, we propose a simple yet effective few-shot training framework for ViTs, namely Self-promoted sUpervisioN (SUN). Specifically, besides the conventional global supervision for global semantic learning SUN further pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token. This location-specific supervision tells the ViT which patch tokens are similar or dissimilar and thus accelerates token dependency learning. Moreover, it models the local semantics in each patch token to improve the object grounding and recognition capability which helps learn generalizable patterns. To improve the quality of location-specific supervision, we further propose two techniques:~1) background patch filtration to filtrate background patches out and assign them into an extra background class; and 2) spatial-consistent augmentation to introduce sufficient diversity for data augmentation while keeping the accuracy of the generated local supervisions. Experimental results show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
翻译:视觉变压器( Vits) 的微小学习能力很少被调查,尽管人们非常渴望。 在这项工作中,我们从经验中发现,使用同样微小的学习框架( eg~ meta- Baseline ), 以 Vitt 模型取代广泛使用的CN 特效提取器, 通常会严重损害微小的分类性能。 此外, 我们的经验研究表明, 在没有感应偏差的情况下, Vits 往往在微小的学习制度下学习低级象征性依赖性, 在那里只有很少的贴标签培训数据, 这在很大程度上有助于上述性能监督的退化。 为了缓解这一问题, 我们第一次为 ViT 提出了简单而有效的微小的学习框架, 即自我促进 SUpersio N (SUN ) 。 具体化的常规全球监督, 在微小的学习数据集上, VIT 通常会生成单个背景监督, 用于指导每个补丁。 这种具体位置的监管告诉 Vitt 保持相似或不甚相近背景的监控。 因此, 加快了一个目标, 也加速了一个匹配的直径直径直径直观学习方法。 。 此外, 改进了 学习了 也提高了 学习了 度 学习了 度 度 度 度 度 度 度 学习了 度 度 度 度 度 平流 度 度 。