Neural network compression techniques have become increasingly popular as they can drastically reduce the storage and computation requirements for very large networks. Recent empirical studies have illustrated that even simple pruning strategies can be surprisingly effective, and several theoretical studies have shown that compressible networks (in specific senses) should achieve a low generalization error. Yet, a theoretical characterization of the underlying cause that makes the networks amenable to such simple compression schemes is still missing. In this study, we address this fundamental question and reveal that the dynamics of the training algorithm has a key role in obtaining such compressible networks. Focusing our attention on stochastic gradient descent (SGD), our main contribution is to link compressibility to two recently established properties of SGD: (i) as the network size goes to infinity, the system can converge to a mean-field limit, where the network weights behave independently, (ii) for a large step-size/batch-size ratio, the SGD iterates can converge to a heavy-tailed stationary distribution. In the case where these two phenomena occur simultaneously, we prove that the networks are guaranteed to be '$\ell_p$-compressible', and the compression errors of different pruning techniques (magnitude, singular value, or node pruning) become arbitrarily small as the network size increases. We further prove generalization bounds adapted to our theoretical framework, which indeed confirm that the generalization error will be lower for more compressible networks. Our theory and numerical study on various neural networks show that large step-size/batch-size ratios introduce heavy-tails, which, in combination with overparametrization, result in compressibility.
翻译:神经网络压缩技术越来越受欢迎,因为这些技术可以大幅降低大型网络的存储和计算要求。最近的实证研究表明,即使是简单的修剪战略也可以令人惊讶地产生效果,而一些理论研究表明,压缩网络(具体意义上的)应该达到低一般化错误。然而,对于使网络适合这种简单压缩计划的根本原因的理论定性仍然缺乏。在本研究中,我们解决了这一根本性问题,并揭示了培训算法的动态在获取这种压缩网络方面具有关键作用。我们关注的焦点是随机梯度下降(SGD),我们的主要贡献是将压缩网络与最近建立的SGD的两个特性联系起来:(一)随着网络规模走向不精确,系统可以趋向于一个平均限值,使网络的权重度独立运行,(二)对于一个大级/批量比率,SGD Iterates的动态在获取这种精细的固定化分布方面起着关键作用。在这两种现象同时发生时,我们证明网络的精确性能保证其精确性与两个特性之间的精确性(我们一般的精确度的精确度的精确度上, ) 和精确的精确的精确度框架会显示我们的精确度, 的精确度的精确度会成为我们的精确度, 。