Learning new classes without forgetting is crucial for real-world applications for a classification model. Vision Transformers (ViT) recently achieve remarkable performance in Class Incremental Learning (CIL). Previous works mainly focus on block design and model expansion for ViTs. However, in this paper, we find that when the ViT is incrementally trained, the attention layers gradually lose concentration on local features. We call this interesting phenomenon as \emph{Locality Degradation} in ViTs for CIL. Since the low-level local information is crucial to the transferability of the representation, it is beneficial to preserve the locality in attention layers. In this paper, we encourage the model to preserve more local information as the training procedure goes on and devise a Locality-Preserved Attention (LPA) layer to emphasize the importance of local features. Specifically, we incorporate the local information directly into the vanilla attention and control the initial gradients of the vanilla attention by weighting it with a small initial value. Extensive experiments show that the representations facilitated by LPA capture more low-level general information which is easier to transfer to follow-up tasks. The improved model gets consistently better performance on CIFAR100 and ImageNet100.
翻译:在分类模型中,无遗忘地学习新类别对于实际应用十分重要。视觉Transformer(ViT)最近在类增量学习中取得了卓越的表现。以往的工作主要集中在ViT的块设计和模型扩展上。然而,在本文中,我们发现,当增量训练ViT时,注意力层逐渐丧失了对于局部特征的集中能力。我们称这一有趣的现象为ViT用于类增量学习的“局部退化”。由于低层次的局部信息对于表示的可迁移性至关重要,因此保留注意力层中的局部性是有益的。在本文中,我们鼓励模型在训练过程中保留更多的本地信息,并设计了一个保持局部信息的注意力(LPA)层以强调本地特征的重要性。具体来说,我们直接将局部信息纳入到原始注意力中,并通过将其与较小的初始值加权来控制原始注意力的初始梯度。广泛的实验证明,通过LPA促进的表示捕捉到更多的低级通用信息,这更容易转移到后续任务中。改进的模型在CIFAR100和ImageNet100上始终获得更好的性能。