Time-dependent data-generating distributions have proven to be difficult for gradient-based training of neural networks, as the greedy updates result in catastrophic forgetting of previously learned knowledge. Despite the progress in the field of continual learning to overcome this forgetting, we show that a set of common state-of-the-art methods still suffers from substantial forgetting upon starting to learn new tasks, except that this forgetting is temporary and followed by a phase of performance recovery. We refer to this intriguing but potentially problematic phenomenon as the stability gap. The stability gap had likely remained under the radar due to standard practice in the field of evaluating continual learning models only after each task. Instead, we establish a framework for continual evaluation that uses per-iteration evaluation and we define a new set of metrics to quantify worst-case performance. Empirically we show that experience replay, constraint-based replay, knowledge-distillation, and parameter regularization methods are all prone to the stability gap; and that the stability gap can be observed in class-, task-, and domain-incremental learning benchmarks. Additionally, a controlled experiment shows that the stability gap increases when tasks are more dissimilar. Finally, by disentangling gradients into plasticity and stability components, we propose a conceptual explanation for the stability gap.
翻译:时间依赖性数据生成分布已被证明是神经网络梯度训练困难的问题,因为贪婪式的更新会导致先前学到的知识遗忘。尽管终身学习领域已取得进展以克服此遗忘,但我们发现一组常见的最先进方法仍然存在严重的遗忘问题,只是这种遗忘是暂时的并随后进入性能恢复阶段。我们将这一引人注目但潜在问题严重的现象称为稳定性差距。由于终身学习模型的标准评估惯例仅在每个任务后进行评估,因此稳定性差距可能一直没有被察觉。因此,我们建立了一个持续评估框架,使用每次迭代的评估,并定义了一组新的指标来量化最坏情况下的性能。实证结果表明,经验回放,基于约束的回放、知识蒸馏和参数规范化方法都容易出现稳定性差距,并且稳定性差距可以在类增量学习,任务增量学习和域增量学习基准测试中观察到。此外,一个对照实验表明,稳定性差距随着任务之间的差异增加而增加。最后,通过将梯度分解为可塑性和稳定性组件,我们提出了稳定性差距的概念性解释。