A growing body of research in continual learning is devoted to overcoming the "Catastrophic Forgetting" of neural networks by designing new algorithms that are more robust to the distribution shifts. While the recent progress in continual learning literature is encouraging, our understanding of what properties of neural networks contribute to catastrophic forgetting is still limited. To address this, instead of focusing on continual learning algorithms, in this work, we focus on the model itself and study the impact of "width" of the neural network architecture on catastrophic forgetting, and show that width has a surprisingly significant effect on forgetting. To explain this effect, we study the learning dynamics of the network from various perspectives such as gradient norm and sparsity, orthogonalization, and lazy training regime. We provide potential explanations that are consistent with the empirical results across different architectures and continual learning benchmarks.
翻译:不断学习的越来越多的研究致力于通过设计对分布变化更加有力的新算法来克服神经网络的“灾难式遗忘”现象。虽然不断学习文学最近的进展令人鼓舞,但我们对神经网络的特性导致灾难性的遗忘的理解仍然有限。 要解决这个问题,我们不注重持续学习算法,而是在这项工作中注重模型本身,研究神经网络结构的“宽度”对灾难性的遗忘的影响,并表明宽度对遗忘有着惊人的重大影响。为了解释这一影响,我们从梯度规范和松动、孔径化和懒惰的培训制度等不同角度研究网络的学习动态。我们提供与不同结构的实证结果和持续学习基准相一致的潜在解释。