A striking observation about iterative magnitude pruning (IMP; Frankle et al. 2020) is that $\unicode{x2014}$ after just a few hundred steps of dense training $\unicode{x2014}$ the method can find a sparse sub-network that can be trained to the same accuracy as the dense network. However, the same does not hold at step 0, i.e. random initialization. In this work, we seek to understand how this early phase of pre-training leads to a good initialization for IMP both through the lens of the data distribution and the loss landscape geometry. Empirically we observe that, holding the number of pre-training iterations constant, training on a small fraction of (randomly chosen) data suffices to obtain an equally good initialization for IMP. We additionally observe that by pre-training only on "easy" training data, we can decrease the number of steps necessary to find a good initialization for IMP compared to training on the full dataset or a randomly chosen subset. Finally, we identify novel properties of the loss landscape of dense networks that are predictive of IMP performance, showing in particular that more examples being linearly mode connected in the dense network correlates well with good initializations for IMP. Combined, these results provide new insight into the role played by the early phase training in IMP.
翻译:有关迭代规模裁剪(IMP; Frankle等人, 2020)的一个引人注目的观察是, $\ unicode{x2014}$, 仅仅在密集训练几百步之后, $\ uncode{x2014}$\ uncode{x2014}$, 方法就可以找到一个与密集网络同样精密的稀疏子网络。 但是, 同样的方法在0步上并不会维持, 即随机初始化。 在这项工作中, 我们试图了解这个早期培训阶段如何通过数据分布和损失地貌测量的透镜为IMP带来良好的初始初始启动。 最后, 我们发现, 保持培训前迭代数的常数常数常数, 对少量( 随机选择的) 数据进行培训, 足以为IMP 获得与稠密网络相同精确度的初始初始初始初始化。 我们还注意到, 仅通过“ 简单化” 培训, 我们就能减少IMP 与全数据集培训或随机选择的子化相比, 找到良好的初始化所需的步骤数量。 最后, 我们发现, 密度网络丢失网络的地貌的新特征特征特征特性特性特性特性特性特性特性特性特性特性特性特性特性特性特性特征特征, 通过预导, 提供这些可靠的综合模型的早期化的早期化的模型化模型化的模型化的模型的模型化的模型化模型化的模型化模型化的模型化的模型化的模型化模型化的模型化的模型。