Neural network pruning is useful for discovering efficient, high-performing subnetworks within pre-trained, dense network architectures. However, more often than not, it involves a three-step process--pre-training, pruning, and re-training--that is computationally expensive, as the dense model must be fully pre-trained. Luckily, several works have empirically shown that high-performing subnetworks can be discovered via pruning without fully pre-training the dense network. Aiming to theoretically analyze the amount of dense network pre-training needed for a pruned network to perform well, we discover a theoretical bound in the number of SGD pre-training iterations on a two-layer, fully-connected network, beyond which pruning via greedy forward selection yields a subnetwork that achieves good training error. This threshold is shown to be logarithmically dependent upon the size of the dataset, meaning that experiments with larger datasets require more pre-training for subnetworks obtained via pruning to perform well. We empirically demonstrate the validity of our theoretical results across a variety of architectures and datasets, including fully-connected networks trained on MNIST and several deep convolutional neural network (CNN) architectures trained on CIFAR10 and ImageNet.
翻译:神经网络修剪对于在经过事先训练、密集的网络结构中发现高效、高性能的子网络是有用的。然而,通常情况下,它涉及三步制的预培训、修剪和再培训,其计算成本很高,因为密集模型必须完全预先培训。幸运的是,一些工程从经验上表明,高性能的子网络可以通过不经过充分预先培训的密集网络进行修剪而发现。为了从理论上分析精密网络运作良好所需的密集网络预培训数量,我们发现在两层完全连接的网络上SGD预培训迭代数的理论约束,超过三步制的预培训,通过贪婪的前期选择产生出一个达到良好培训错误的子网络。这一阈值在逻辑上取决于数据集的大小,这意味着对大数据集的实验需要更多预培训,才能使通过预选的网络运行良好。我们从经验上展示了在一系列经过培训的架构和图像网络(包括完全连接的ICFAR)上我们经过若干项理论结果的有效性。