This paper studies a new design of the optimization algorithm for training deep learning models with a fixed architecture of the classification network in a continual learning framework. The training data is non-stationary and the non-stationarity is imposed by a sequence of distinct tasks. We first analyze a deep model trained on only one learning task in isolation and identify a region in network parameter space, where the model performance is close to the recovered optimum. We provide empirical evidence that this region resembles a cone that expands along the convergence direction. We study the principal directions of the trajectory of the optimizer after convergence and show that traveling along a few top principal directions can quickly bring the parameters outside the cone but this is not the case for the remaining directions. We argue that catastrophic forgetting in a continual learning setting can be alleviated when the parameters are constrained to stay within the intersection of the plausible cones of individual tasks that were so far encountered during training. Based on this observation we present our direction-constrained optimization (DCO) method, where for each task we introduce a linear autoencoder to approximate its corresponding top forbidden principal directions. They are then incorporated into the loss function in the form of a regularization term for the purpose of learning the coming tasks without forgetting. Furthermore, in order to control the memory growth as the number of tasks increases, we propose a memory-efficient version of our algorithm called compressed DCO (DCO-COMP) that allocates a memory of fixed size for storing all autoencoders. We empirically demonstrate that our algorithm performs favorably compared to other state-of-art regularization-based continual learning methods.
翻译:本文研究用于培训深学习模型的优化算法的新设计,该算法在不断学习的框架内有一个固定的分类网络结构。 培训数据是非静止的, 非固定性是由一系列不同的任务所强加的。 我们首先分析一个单项学习任务所训练的深度模型, 并在网络参数空间中确定一个区域, 模型性能接近于回收的最佳状态。 我们提供经验证明, 这个区域类似于一个在趋同方向上扩展的锥体。 我们研究优化者在趋同后轨迹的主要方向, 并表明沿着几个顶级主方向旅行可以迅速将参数带出锥体之外, 但其余方向则不是这样。 我们论证说, 当参数被限制在单项学习任务中, 在一个仅进行单独学习的深层模型中, 在一个持续学习的深度模型中, 可以减轻在连续学习环境中发生的灾难性的遗忘。 基于这一观察, 我们为每项任务引入直线式自动解算法, 以近似其顶级主方向。 然后, 它们被融入了一个高级主方向的缩缩缩功能, 将一个不断的缩缩缩缩算, 将一个常规任务作为我们学习的缩缩缩缩缩缩缩的缩任务。