Continual learning is a setting where machine learning models learn novel concepts from continuously shifting training data, while simultaneously avoiding degradation of knowledge on previously seen classes which may disappear from the training data for extended periods of time (a phenomenon known as the catastrophic forgetting problem). Current approaches for continual learning of a single expanding task (aka class-incremental continual learning) require extensive rehearsal of previously seen data to avoid this degradation of knowledge. Unfortunately, rehearsal comes at a cost to memory, and it may also violate data-privacy. Instead, we explore combining knowledge distillation and parameter regularization in new ways to achieve strong continual learning performance without rehearsal. Specifically, we take a deep dive into common continual learning techniques: prediction distillation, feature distillation, L2 parameter regularization, and EWC parameter regularization. We first disprove the common assumption that parameter regularization techniques fail for rehearsal-free continual learning of a single, expanding task. Next, we explore how to leverage knowledge from a pre-trained model in rehearsal-free continual learning and find that vanilla L2 parameter regularization outperforms EWC parameter regularization and feature distillation. Finally, we explore the recently popular ImageNet-R benchmark, and show that L2 parameter regularization implemented in self-attention blocks of a ViT transformer outperforms recent popular prompting for continual learning methods.
翻译:迭代式学习是一种机器学习模型从不断变化的训练数据中学习新概念的方式,同时避免以前看到的类别中发生灾难性遗忘问题。当前单一扩展任务(也称为增量迭代式学习)的迭代式学习方法需要大量排除以前看到的数据,以避免知识的降级。不幸的是,回忆会付出内存的代价,并且它还可能违反数据隐私。相反,我们探索以新的方式将知识蒸馏和参数规范化相结合,实现强大的不需要排除回忆的迭代式学习性能。具体而言,我们深入研究常见的迭代式学习技术:预测蒸馏,特征蒸馏,L2参数规范化和EWC参数规范化。我们首先推翻常见的参数规范化技术在单一扩展任务的不需要排除回忆的迭代式学习中失败的假设。接下来,我们探讨如何利用预训练模型中的知识进行无需排除回忆的迭代式学习,并发现纯粹的L2参数规范化优于EWC参数规范化和特征蒸馏。最后,我们探索最近推出的ImageNet-R基准,表明在ViT transformer的自注意力块中实现的L2参数规范化优于最近流行的迭代式学习提示方法。