初始化时要什么要什么不要什么要什么初始化时要什么不要什么要什么 (What to Prune and What Not to Prune at Initialization)

Post-training dropout based approaches achieve high sparsity and are well established means of deciphering problems relating to computational cost and overfitting in Neural Network architectures. Contrastingly, pruning at initialization is still far behind. Initialization pruning is more efficacious when it comes to scaling computation cost of the network. Furthermore, it handles overfitting just as well as post training dropout. In approbation of the above reasons, the paper presents two approaches to prune at initialization. The goal is to achieve higher sparsity while preserving performance. 1) K-starts, begins with k random p-sparse matrices at initialization. In the first couple of epochs the network then determines the "fittest" of these p-sparse matrices in an attempt to find the "lottery ticket" p-sparse network. The approach is adopted from how evolutionary algorithms find the best individual. Depending on the Neural Network architecture, fitness criteria can be based on magnitude of network weights, magnitude of gradient accumulation over an epoch or a combination of both. 2) Dissipating gradients approach, aims at eliminating weights that remain within a fraction of their initial value during the first couple of epochs. Removing weights in this manner despite their magnitude best preserves performance of the network. Contrarily, the approach also takes the most epochs to achieve higher sparsity. 3) Combination of dissipating gradients and kstarts outperforms either methods and random dropout consistently. The benefits of using the provided pertaining approaches are: 1) They do not require specific knowledge of the classification task, fixing of dropout threshold or regularization parameters 2) Retraining of the model is neither necessary nor affects the performance of the p-sparse network.

翻译：以培训后辍学为基础的方法具有高度的偏狭性, 并且是破解与计算成本和神经网络架构过度配置有关的问题的既定手段。相比之下, 初始化时的裁剪仍然远远落后。初始化的剪裁在计算网络成本的缩放中比较有效。此外, 初始化的剪裁在计算网络成本方面处理的功能上比较有效。在适应上述原因时, 本文展示了两种在初始化时的裁剪方法。目标是在保持性能的同时, 实现更高度的宽度。 1 K 起始阶段, 在初始化时, 以 k随机的 p- 抽调矩阵开始。在最初的几组中, 初始化的裁剪裁时, 也决定着“ 适量性能” p- 扭曲网络。在进化算法中, 取决于神经网络模型结构, 健身标准可以基于网络重量的大小、梯度积累的幅度或两者的组合。 2 在初始的递减法中, 最大幅度的递减性变变变的计算方法, 在初始的递变变变变变变的计算方法中, 目的的进度方法需要的缩的平的平的平的平比重法。