Deep learning architectures with a huge number of parameters are often compressed using pruning techniques to ensure computational efficiency of inference during deployment. Despite multitude of empirical advances, there is a lack of theoretical understanding of the effectiveness of different pruning methods. We inspect different pruning techniques under the statistical mechanics formulation of a teacher-student framework and derive their generalization error (GE) bounds. It has been shown that Determinantal Point Process (DPP) based node pruning method is notably superior to competing approaches when tested on real datasets. Using GE bounds in the aforementioned setup we provide theoretical guarantees for their empirical observations. Another consistent finding in literature is that sparse neural networks (edge pruned) generalize better than dense neural networks (node pruned) for a fixed number of parameters. We use our theoretical setup to prove this finding and show that even the baseline random edge pruning method performs better than the DPP node pruning method. We also validate this empirically on real datasets.
翻译:具有大量参数的深层学习结构往往使用修剪技术压缩,以确保在部署期间计算推算效率。尽管取得了许多经验性进展,但对于不同修剪方法的有效性缺乏理论上的理解。我们根据教师-学生框架的统计力设计,检查了不同的修剪技术,并得出了它们的概括性误差。已经证明,基于 Dizimonantal Point 的节点修剪方法(DPP) 明显优于在实际数据集测试时的竞争性方法。在使用上述设置中的 GE 界限时,我们为它们的经验观测提供了理论保证。文献中的另一个一致发现是,稀疏的神经网络(尖锐的修剪)比固定数参数的密度神经网络(node pratched)要好。我们利用我们的理论设置来证明这一发现,并显示即使基线随机边缘剪裁方法也比DPP node pring方法要好。我们还在实际数据集上验证了这一经验性的方法。