In spite of advances in understanding lazy training, recent work attributes the practical success of deep learning to the rich regime with complex inductive bias. In this paper, we study rich regime training empirically with benchmark datasets, and find that while most parameters are lazy, there is always a small number of active parameters which change quite a bit during training. We show that re-initializing (resetting to their initial random values) the active parameters leads to worse generalization. Further, we show that most of the active parameters are in the bottom layers, close to the input, especially as the networks become wider. Based on such observations, we study static Layer-Wise Sparse (LWS) SGD, which only updates some subsets of layers. We find that only updating the top and bottom layers have good generalization and, as expected, only updating the top layers yields a fast algorithm. Inspired by this, we investigate probabilistic LWS-SGD, which mostly updates the top layers and occasionally updates the full network. We show that probabilistic LWS-SGD matches the generalization performance of vanilla SGD and the back-propagation time can be 2-5 times more efficient.
翻译:尽管在理解懒惰培训方面取得了进展,但最近的工作将深层次学习的实际成功归功于富有的制度,其输入偏差复杂。在本文中,我们研究富集的制度培训,用基准数据集进行经验性培训,发现尽管大多数参数是懒惰的,但总有一些积极参数在培训期间变化很大。我们显示,重新启用(调整其初始随机值)活动参数会导致更糟糕的概括化。此外,我们显示,大多数活动参数都位于底层,接近投入,特别是随着网络的扩大。我们根据这些观察,研究静态的图层-维兹SGD(LWS) SGD(LWS),它仅更新了几层的子。我们发现,只有更新顶层和底层才有良好的概括化作用,并且如预期的那样,仅更新顶层才能产生一种快速的算法。我们为此调查了概率性LWS-SGD(GS-SGD),它主要更新了顶层,有时更新整个网络。我们显示,预测性LWSS-SGD(LWS-SGD)与Vanilla SGD(SGD)的一般性能的时间可以更短。