This paper studies a new design of the optimization algorithm for training deep learning models with a fixed architecture of the classification network in a continual learning framework, where the training data is non-stationary and the non-stationarity is imposed by a sequence of distinct tasks. This setting implies the existence of a manifold of network parameters that correspond to good performance of the network on all tasks. Our algorithm is derived from the geometrical properties of this manifold. We first analyze a deep model trained on only one learning task in isolation and identify a region in network parameter space, where the model performance is close to the recovered optimum. We provide empirical evidence that this region resembles a cone that expands along the convergence direction. We study the principal directions of the trajectory of the optimizer after convergence and show that traveling along a few top principal directions can quickly bring the parameters outside the cone but this is not the case for the remaining directions. We argue that catastrophic forgetting in a continual learning setting can be alleviated when the parameters are constrained to stay within the intersection of the plausible cones of individual tasks that were so far encountered during training. Enforcing this is equivalent to preventing the parameters from moving along the top principal directions of convergence corresponding to the past tasks. For each task we introduce a new linear autoencoder to approximate its corresponding top forbidden principal directions. They are then incorporated into the loss function in the form of a regularization term for the purpose of learning the coming tasks without forgetting. We empirically demonstrate that our algorithm performs favorably compared to other state-of-art regularization-based continual learning methods, including EWC and SI.
翻译:本文研究用于培训深学习模型的优化算法的新设计,该算法在连续学习框架内以固定的分类网络结构对深学习模型进行培训,在连续学习框架内,培训数据是非静止的,非静态是由不同任务顺序强加的。这一设置意味着存在一系列符合网络所有任务良好业绩的网络参数。我们的算法来自这个元件的几何特性。我们首先分析一个仅就一个孤立的学习任务受过训练的深度模型,并在网络参数空间中确定一个区域,模型性能接近于回收的最佳状态。我们提供经验证据,证明这个区域类似于一个沿不断趋同方向扩展的连接线。我们研究了优化者轨迹的主要方向,并表明沿着几个顶端主方向运行的网络参数可以很快将参数带出锥形外,但对于其余方向则并非如此。我们说,如果参数局限于在培训期间所遭遇到的单个任务的合理联系点之间,那么在网络参数中发现一个区域,这个区域就是一个区域。我们说,这个区域类似于在不断趋同的趋同趋同的趋同的趋同的趋同的趋同点。我们过去最接近于向最接近于最接近于最接近尾的轨道的轨道的轨道的任务。