亚当的"最佳刺激器"是否被灾难性的灾难遗忘了? (Does the Adam Optimizer Exacerbate Catastrophic Forgetting?)

from arxiv, 9 pages in main text + 3 pages of references + 16 pages of appendices, 6 figures in main text + 21 figures in appendices, 6 tables in appendices; source code available at https://github.com/dylanashley/catastrophic-forgetting/tree/arxiv

Catastrophic forgetting remains a severe hindrance to the broad application of artificial neural networks (ANNs), however, it continues to be a poorly understood phenomenon. Despite the extensive amount of work on catastrophic forgetting, we argue that it is still unclear how exactly the phenomenon should be quantified, and, moreover, to what degree all of the choices we make when designing learning systems affect the amount of catastrophic forgetting. We use various testbeds from the reinforcement learning and supervised learning literature to (1) provide evidence that the choice of which modern gradient-based optimization algorithm is used to train an ANN has a significant impact on the amount of catastrophic forgetting and show that-surprisingly-in many instances classical algorithms such as vanilla SGD experience less catastrophic forgetting than the more modern algorithms such as Adam. We empirically compare four different existing metrics for quantifying catastrophic forgetting and (2) show that the degree to which the learning systems experience catastrophic forgetting is sufficiently sensitive to the metric used that a change from one principled metric to another is enough to change the conclusions of a study dramatically. Our results suggest that a much more rigorous experimental methodology is required when looking at catastrophic forgetting. Based on our results, we recommend inter-task forgetting in supervised learning must be measured with both retention and relearning metrics concurrently, and intra-task forgetting in reinforcement learning must-at the very least-be measured with pairwise interference.

翻译：尽管在灾难性的遗忘问题上做了大量工作,但我们认为,目前还不清楚该现象究竟应该如何量化,此外,在设计学习系统时,我们作出的所有选择在多大程度上影响灾难性遗忘的程度。我们使用强化学习和监督学习文献中的各种测试床,以便:(1) 提供证据,证明使用现代梯度优化算法来培训一个国家,对灾难性的遗忘程度产生了重大影响,并表明,许多典型算法,如Vanilla SGD经历的遗忘程度比亚当这样的现代算法少得多,令人惊讶。我们从经验上比较了四种不同的现有指标,以量化灾难性遗忘的程度,(2) 表明学习系统被灾难性遗忘的程度对于所使用的衡量标准已经足够敏感,即从一种原则性衡量标准改为另一种衡量标准足以大大改变研究的结论。我们的结果表明,在研究灾难性的遗忘时,需要更加严格的实验方法,例如Vanilla SGD经历的遗忘经历比亚当这样的现代算法少得多。我们从经验上比较了四种不同的现有指标,以量化灾难性的遗忘程度,(2) 表明,从一种原则性衡量从一种原则性衡量到另一种方法足以改变一项研究的结论。我们建议,在研究中,在研究中,必须用一种更加精确地学会学习,在不断学习中进行。