In the light of the fact that the stochastic gradient descent (SGD) often finds a flat minimum valley in the training loss, we propose a novel directional pruning method which searches for a sparse minimizer in or close to that flat region. The proposed pruning method does not require retraining or the expert knowledge on the sparsity level. To overcome the computational formidability of estimating the flat directions, we propose to use a carefully tuned $\ell_1$ proximal gradient algorithm which can provably achieve the directional pruning with a small learning rate after sufficient training. The empirical results demonstrate the promising results of our solution in highly sparse regime (92% sparsity) among many existing pruning methods on the ResNet50 with the ImageNet, while using only a slightly higher wall time and memory footprint than the SGD. Using the VGG16 and the wide ResNet 28x10 on the CIFAR-10 and CIFAR-100, we demonstrate that our solution reaches the same minima valley as the SGD, and the minima found by our solution and the SGD do not deviate in directions that impact the training loss. The code that reproduces the results of this paper is available at https://github.com/donlan2710/gRDA-Optimizer/tree/master/directional_pruning.
翻译:鉴于在培训损失中,偏差梯度梯度下降通常会发现一个最低谷值,我们建议采用一种新的方向性裁剪方法,在平地地区或接近该平地的地区寻找稀薄的最小化器。拟议的裁剪方法不需要再培训或对宽度水平的专家知识。为了克服估算平面方向的计算可能性,我们提议使用一种经过仔细调整的美元=1美元准度梯度算法,这种算法在经过充分培训后能够以微小的学习率实现方向性下降。实验结果表明,在ResNet50和图像网的许多现有裁剪裁剪方法中(92%的悬浮性),我们以高度分散的制度(92%的悬浮性)找到解决办法取得了有希望的结果,同时只使用比SGD略高一点的墙时间和记忆足迹。我们建议使用VGG16和宽阔的ResNet 28x10的CFAR-10和CIFAR-100,我们证明我们的解决办法与SGD达到相同的迷性谷,以及我们的解决办法和SGD-SGD在高地-OrentalA/RA/RDRDRA/RDRA在现有的方向上没有偏离方向,从而影响可复制的论文/RA/RArevalmexmexmal/RA/RA/RA/RMismal/RA/RA/RA/R/RA/R/R/R/R/RAS/RAS/RAS/R/RAS/R/RMRMRMR/R/RAS/RMRMRMR/RMR/RMR/R/RR/RRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR/RRRRRRRRRRRRDRDRBA的结果。