Several recent works [40, 24] observed an interesting phenomenon in neural network pruning: A larger finetuning learning rate can improve the final performance significantly. Unfortunately, the reason behind it remains elusive up to date. This paper is meant to explain it through the lens of dynamical isometry [42]. Specifically, we examine neural network pruning from an unusual perspective: pruning as initialization for finetuning, and ask whether the inherited weights serve as a good initialization for the finetuning? The insights from dynamical isometry suggest a negative answer. Despite its critical role, this issue has not been well-recognized by the community so far. In this paper, we will show the understanding of this problem is very important -- on top of explaining the aforementioned mystery about the larger finetuning rate, it also unveils the mystery about the value of pruning [5, 30]. Besides a clearer theoretical understanding of pruning, resolving the problem can also bring us considerable performance benefits in practice.
翻译:最近的一些作品[40, 24] 观察到神经网络运行中一个有趣的现象: 更大的微调学习率可以显著改善最后的绩效。 不幸的是, 它背后的原因至今仍然难以找到。 本文意在用动态异构法[42] 的镜头来解释它。 具体地说, 我们从一个不寻常的角度来研究神经网络运行: 作为微调的初始化, 并询问所继承的重量是否是微调的良好初始化? 动态异构法的洞察显示一个否定的答案。 尽管它具有关键的作用, 这个问题至今还没有被社区充分认识到。 在本文中, 我们将展示对这一问题的理解非常重要 — — 除了解释上述关于较大微调率的神秘性之外, 它还揭示了对精调价值的神秘性[5, 30] 。 除了更清楚的理论理解对细调, 解决问题还能给我们在实践中带来相当大的业绩效益。