Pruning schemes have been widely used in practice to reduce the complexity of trained models with a massive number of parameters. Several practical studies have shown that pruning an overparameterized model and fine-tuning generalizes well to new samples. Although the above pipeline, which we refer to as pruning + fine-tuning, has been extremely successful in lowering the complexity of trained models, there is very little known about the theory behind this success. In this paper we address this issue by investigating the pruning + fine-tuning framework on the overparameterized matrix sensing problem, with the ground truth denoted $U_\star \in \mathbb{R}^{d \times r}$ and the overparameterized model $U \in \mathbb{R}^{d \times k}$ with $k \gg r$. We study the approximate local minima of the empirical mean square error, augmented with a smooth version of a group Lasso regularizer, $\sum_{i=1}^k \| U e_i \|_2$ and show that pruning the low $\ell_2$-norm columns results in a solution $U_{\text{prune}}$ which has the minimum number of columns $r$, yet is close to the ground truth in training loss. Initializing the subsequent fine-tuning phase from $U_{\text{prune}}$, the resulting solution converges linearly to a generalization error of $O(\sqrt{rd/n})$ ignoring lower order terms, which is statistically optimal. While our analysis provides insights into the role of regularization in pruning, we also show that running gradient descent in the absence of regularization results in models which {are not suitable for greedy pruning}, i.e., many columns could have their $\ell_2$ norm comparable to that of the maximum. Lastly, we extend our results for the training and pruning of two-layer neural networks with quadratic activation functions. Our results provide the first rigorous insights on why greedy pruning + fine-tuning leads to smaller models which also generalize well.
翻译:剪枝方案在实践中被广泛用于减少训练模型中的参数数量。一些实际的研究表明,修剪超参数模型并进行微调可以很好地概括新样本。虽然以上所述的管道,我们称之为修剪+微调,已经在降低训练模型的复杂度方面非常成功,但是关于这种成功背后的理论知识还知之甚少。本文通过对过度参数化矩阵感知问题的研究,解决了这个问题:该问题中的真实值为$U_\star \in \mathbb{R}^{d \times r}$,超参数模型为$U \in \mathbb{R}^{d \times k}$,其中$k \gg r$。我们研究了诸如$\sum_{i=1}^k \| U e_i \|_2$这样的正则化平滑组Lasso的经验均方误差的近似局部极小值,并表明修剪低$\ell_2$-范数的列会导致一个解$U_{\text{prune}}$,该解具有最小的列数$r$,但在训练误差上接近于真实值。从$U_{\text{prune}}$初始化后续微调阶段的结果会以线性方式收敛到$O(\sqrt{rd/n})$的泛化误差,忽略较低阶项,这是统计最优的。虽然我们的分析提供了有关修剪中正则化作用的见解,但我们还表明,缺少正则化的梯度下降运行的模型是不适合贪婪剪枝的,即许多列的$\ell_2$范数可与最大值相媲美。最后,我们将结果扩展到具有二次激活函数的两层神经网络的培训和剪枝。我们的结果提供了第一个严格的见解,为什么贪婪剪枝+微调可以导致较小但也具有很好的概括性的模型。