To better understand catastrophic forgetting, we study fitting an overparameterized linear model to a sequence of tasks with different input distributions. We analyze how much the model forgets the true labels of earlier tasks after training on subsequent tasks, obtaining exact expressions and bounds. We establish connections between continual learning in the linear setting and two other research areas: alternating projections and the Kaczmarz method. In specific settings, we highlight differences between forgetting and convergence to the offline solution as studied in those areas. In particular, when T tasks in d dimensions are presented cyclically for k iterations, we prove an upper bound of T^2 * min{1/sqrt(k), d/k} on the forgetting. This stands in contrast to the convergence to the offline solution, which can be arbitrarily slow according to existing alternating projection results. We further show that the T^2 factor can be lifted when tasks are presented in a random ordering.
翻译:为了更好地理解灾难性的遗忘, 我们研究将一个过度参数化的线性模型安装到一个具有不同输入分布的任务序列上。 我们分析模型在对随后的任务进行培训之后, 在多大程度上忘记了先前任务的真正标签, 获得精确的表达和界限。 我们建立了线性环境中的持续学习与其他两个研究领域之间的联系: 交替预测和卡茨马尔兹方法。 在特定环境中, 我们突出在这些地区研究的忘记和与离线解决方案的趋同之间的差异。 特别是当D维的T任务被周期性地显示为 k迭代时, 我们证明在遗忘上具有T&2 * min{1/ sqrt(k), d/k) 的上界。 这与离线解决方案的趋同性是相反的, 根据现有的交替预测结果, 离线性解决方案可能会任意缓慢。 我们还表明, 当任务按随机顺序提出时, T& 2 系数可以解除 。