This paper studies using Vision Transformers (ViT) in class incremental learning. Surprisingly, naive application of ViT to replace convolutional neural networks (CNNs) results in performance degradation. Our analysis reveals three issues of naively using ViT: (a) ViT has very slow convergence when class number is small, (b) more bias towards new classes is observed in ViT than CNN-based models, and (c) the proper learning rate of ViT is too low to learn a good classifier. Base on this analysis, we show these issues can be simply addressed by using existing techniques: using convolutional stem, balanced finetuning to correct bias, and higher learning rate for the classifier. Our simple solution, named ViTIL (ViT for Incremental Learning), achieves the new state-of-the-art for all three class incremental learning setups by a clear margin, providing a strong baseline for the research community. For instance, on ImageNet-1000, our ViTIL achieves 69.20% top-1 accuracy for the protocol of 500 initial classes with 5 incremental steps (100 new classes for each), outperforming LUCIR+DDE by 1.69%. For more challenging protocol of 10 incremental steps (100 new classes), our method outperforms PODNet by 7.27% (65.13% vs. 57.86%).
翻译:令人惊讶的是,我们的分析揭示了三个天真的使用VIT的问题。 我们的分析揭示了三个使用VIT的天真问题:(a) VIT在班数小的时候会非常缓慢地趋同;(b) VIT比CNN的模型更加偏向新班,(c) VIT的适当学习率太低,无法学习优秀的分类者。根据这项分析,我们展示了这些问题,只要使用现有技术就可以简单地解决这些问题:使用Culationral 干、平衡的微调以纠正偏向和高的叙级员学习率。我们称为VITL(VIT用于递增学习)的简单解决方案在班数小的时候会非常缓慢地趋同;(b) VIT比有CNN的模型对新班的偏向性更强;(c) Viet的正确学习率太低,无法学习一个良好的分类者。例如,在图像网1000上,我们的VTIL为500个初始班的校规程达到69.20%的顶级,有5个递增步骤(每班100个新班),比LUCIR+DDDE10级更具有挑战性的方法为100%。