一文读懂模型压缩

2020 年 3 月 16 日 极市平台

加入极市专业CV交流群，与 10000+来自港科大、北大、清华、中科院、CMU、腾讯、百度等名校名企视觉开发者互动交流！

同时提供每月大咖直播分享、真实项目需求对接、干货资讯汇总，行业技术交流。关注 极市平台 公众号，回复 加群，立刻申请入群~

本文译自: Do We Really Need Model Compression?，http://mitchgordon.me/machine/learning/2020/01/13/do-we-really-need-model-compression.html#fn:lottery-general

作者：Mitchell A. Gordon

编译：mountain blu

为便于理解，在原文的基础上进行了精简和修改，希望能够将核心的点展现给大家。

PS：恰逢疫情发展，有了阅读优秀博客和论文的机会。最近半年多以来，因工作变动的原因，写代码的任务变少了。再加上工作比较累，没什么时间阅读优秀论文、了解业界前沿，导致原创文章越来越少。希望新的一年有时间多写写原创，没时间多收集、阅读行业博客和论文，多积累，多学习。

什么是模型压缩？

模型压缩的目标是保证模型预测效果的前提下，尽可能地降低模型的大小

为什么要进行模型压缩？

1. 模型压缩后，模型很小，进行推断的运算量小，利于在移动端部署。

2. 诸如Bert等深度学习的参数太多了，模型太大了，消耗的计算资源过多，进一步加大了深度学习爱好者们的“贫富差距”，不够和谐。以Bert-large为例，训练一次需要64G显存的google TPU，按照每小时6.5美元的价格，1024块训练76分钟，总消耗：1024*6.5*（76/60）=8430美金。一般人和公司真是玩不起。模型压缩领域的“有志之士”的终极目标是能够让“贫苦的深度学习爱好者”也玩得起，他们进行了一些列的研究，他们发现使用压缩后（更小的）模型也能够达到原始模型类似的效果。

常见的模型压缩方法有哪些？

Pruning（修剪）: 因为神经网络很多权重几乎为0，这类参数作用不大，部分参数删掉也不影响模型预测效果
Weight Factorization（权重分解）：权重矩阵可以进行低秩矩阵分解，即low-rank matrix factorization，从而使得一些参数为0
Quantization（削减精度）：能用float32，不用float64；能用int，不用float
Weight Sharing（共享权重）：很多layer的参数可以共享，没必要用太多参数

很多小伙伴可能会想：模型压缩只是“大模型”的后续处理，也不能让我们玩转大模型哇。理想的方式是，我们只设计好一个小的模型就能达到非常好的效果。

实际上，模型压缩的出现，让我们看到了“好的小模型的样子”，对我们后续设计小模型有很多借鉴意义。

为什么我们很难设计一个“参数正好的模型”？

我们确实很难设计一个“参数正好的模型”，原因有如下两点：

给定一个任务的数据集，我们很难判断该任务的难度，从而很难判断模型合适的参数数目
我们事先知道了模型及参数的数目，但“一个参数正好的模型”难于训练

第二个原因可能不太好理解，这就要先谈一下大模型有什么优势。

《Gradient Descent Finds Global Minima of Deep Neural Networks》和《Global Optimality in Neural Network Training》

两篇论文从数学上证明了：大模型能够使得损失函数更接近凸函数，利于求解。相反小模型可能难于训练。对于很多简单的任务，只要使用超量的参数，一定能在多项式时间内使得损失函数趋近于0。这两篇论文试图给出参数数目“大概的”上下界。

当然，很难精确地给出一个具体的问题的参数上下界，得出该上下界可能比训练一个大的神经网络更加耗时、耗力。

这方面的前沿研究很多，可参考[参数bounds估计]( http://mitchgordon.me/assets/overparam-papers.bibtex )。

接下来，假设我们拿到了一个“参数正好的模型”，那么就面临一个问题：

如何进行训练？

Frankel and Carbin在《Linear Mode Connectivity and the Lottery Ticket Hypothesis》中使用了pruning的技术，得到了一个“稀疏的神经网络”，如下图所示：

但当作者们直接使用稀疏的神经网络&随机初始化参数进行训练时，发现无法得到一个稳定的解。即只有当该网络的参数初始化与prune前的网络完全相同时，才能训练成功。也就是说，模型训练能否训练成功，跟参数的初始化有关。要得到一个好模型，可能要靠运气了。

近期在参考文献[11]、[12]和[13]中纷纷证明了：一个参数正好的稀疏神经网络是可以成功训练的，其诀窍不在参数初始化，而是如何在压缩模型后的稀疏子空间中求解的能力。

相似的工作也有，比如Lee et al. (2018)在[14]论文中尝试通过一次训练过程中寻找到稀疏神经网络的结构。

当前阶段，虽然模型压缩仅仅是复杂模型的后处理，但在该领域的研究和探索揭示了“参数正好模型”的样子。

根据上面的论述，后续的研究可能会有三个趋势：

1. 压缩后的模型有可能利于发现"冗余参数"的共同特征，然后我们就可以利用这些特征去削减参数

2. 正则化、偏置、方差与“冗余参数”之间的关系

3. 出现更棒的优化方法，能够在稀疏神经网络上训练得到很好的稳定解

模型压缩真的有用吗？

本文对此提出了质疑，但本质上模型压缩很有用。近年来很多研究人员投入到模型压缩的研究中，这个领域仍然存在着很多“谜题”等待着大家给出答案。相信模型压缩能够推动神经网络在终端推断、可解释性、训练、优化等向前发展。

最后列举几个模型压缩方面的未来方向以及可探究的问题：

超量参数方面

基于特定的数据训练模型时，我们能否得到准确的参数数量上下界？
如果能够计算得到上下界，我们能否将上下界的理论扩展到其他常见的模型中，比如RNNs和Transformers?

优化方面

除去文中阐述的几种参数冗余，我们是否还尚未发现一些其他问题导致参数冗余？
如何基于低精度的神经网络参数，训练一个好的模型？
如何基于低秩矩阵分解后的神经网络参数，训练一个好的模型？
探索知识蒸馏能够提高优化效果的原因，其中的原理是否能够降低GPU显存的使用？

正则化

裁剪神经网络与L0范数之间的关系是什么？正则与裁剪之间的关系是什么？
什么样的正则能够帮助降低参数的精度？什么样的正则能够降低参数的数目？正则与参数冗余之间的关系是什么？

参考文献

1. Much more on Deep Learning’s Size Problem.

2. A common example of this is XOR which can theoretically be represented with two hidden neurons but in practice requires using around twenty.

3. Kukačka, Jan, Vladimir Golkov, and Daniel Cremers. 2017. “Regularization for Deep Learning: A Taxonomy.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1710.10686.

4. Zhang, Chiyuan, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. 2016. “Understanding Deep Learning Requires Rethinking Generalization.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1611.03530.

5. Du, Simon S., Jason D. Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. 2018. “Gradient Descent Finds Global Minima of Deep Neural Networks.” arXiv [cs.LG]. arXiv. Gradient Descent Finds Global Minima of Deep Neural Networks.

6. Haeffele, Benjamin D., and René Vidal. 2017. “Global Optimality in Neural Network Training.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 7331–39.

7. And it’s very active. I’ve seen a bunch of papers (that I haven’t read) improving on these types of bounds.

8. Theoretically, though, we at least know that training a 3 neuron neural network is NP-hard. There are similar negative results for other specific tasks and architectures. There might be proof that over-parameterization is necessary and sufficient for successful training. You might be interested in this similar, foundational work.

9. Frankle, Jonathan, Gintare Karolina Dziugaite, Daniel M. Roy, and Michael Carbin. 2019. “Linear Mode Connectivity and the Lottery Ticket Hypothesis.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1912.05671.

10. Zhou (2019) explores this idea with more detailed experiments. Liu et al. (2018) found similar results for structured pruning (convolution channels, etc.) instead of weight pruning. They, however, could randomly initialize the structure pruned networks and train them just as well as the un-pruned networks. The difference between these results remains un-explained.

11. Dettmers, Tim, and Luke Zettlemoyer. 2019. “Sparse Networks from Scratch: Faster Training without Losing Performance.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1907.04840.

12. Mostafa, Hesham, and Xin Wang. 2019. “Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization.” arXiv [cs.LG]. arXiv. Parameter Efficient Training of Deep Convolutional Neural Networks by Dynamic Sparse Reparameterization.

13. Evci, Utku, Trevor Gale, Jacob Menick, Pablo Samuel Castro, and Erich Elsen. 2019. “Rigging the Lottery: Making All Tickets Winners.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1911.11134.

14. Lee, Namhoon, Thalaiyasingam Ajanthan, and Philip H. S. Torr. 2018. “SNIP: Single-Shot Network Pruning Based on Connection Sensitivity.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1810.02340.

15. More work is being done on deciding whether lottery tickets are general.

16. Note that model compression is not the only path to memory-efficient training. For example, gradient checkpointing lets you trade computation time for memory when computing gradients during backprop.

17. I would say pruning and weight sharing are almost fully explored at this point, while quantization, factorization, and knowledge distillation have the biggest opportunity for improvements.

18. Gale, Trevor, Erich Elsen, and Sara Hooker. 2019. “The State of Sparsity in Deep Neural Networks.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1902.09574.

19. What type of regularization induces these 0 weights? It’s not entirely clear. Haeffele and Vidal (2017)6 proved that when a certain class of neural networks achieve a global optimum, the parameters of some sub-network become 0. If training impicitly or explicitly prefers L0 regularized solutions, then the weights will also be sparse.

20. Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. “ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations.” arXiv [cs.CL]. arXiv. http://arxiv.org/abs/1909.11942.

21. Here’s a survey. Other examples include QBERT and Bitwise Neural Networks.

22. Note that quantized networks need special hardware to really see gains, which might explain why quantization is less popular than some of the other methods.

23. inFERENCe has some thoughts about this from the Bayesian perspective. In short, flat minima (which may or may not lead to generalization) should have parameters with a low minimum-description length. Another explanation is that networks that are robust to noise generalize better, and round-off error can be thought of as a type of regularization.

24. Rastegari, Mohammad, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. “XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1603.05279.

25. Zhou, Shuchang, Zekun Ni, Xinyu Zhou, He Wen, Yuxin Wu, and Yuheng Zou. 2016. “DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients.” https://www.semanticscholar.org/paper/8b053389eb8c18c61b84d7e59a95cb7e13f205b7.

26. Lin, Xiaofan, Cong Zhao, and Wei Pan. 2017. “Towards Accurate Binary Convolutional Neural Network.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1711.11294.

27. Wang, Ziheng, Jeremy Wohlwend, and Tao Lei. 2019. “Structured Pruning of Large Language Models.” arXiv [cs.CL]. arXiv. Structured Pruning of Large Language Models.

28. Denton, Emily, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. 2014. “Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation.” arXiv [cs.CV]. arXiv. http://arxiv.org/abs/1404.0736.

29. Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. 2015. “Distilling the Knowledge in a Neural Network.” arXiv [stat.ML]. arXiv. http://arxiv.org/abs/1503.02531.

30. Kim, Yoon, and Alexander M. Rush. 2016. “Sequence-Level Knowledge Distillation.” arXiv [Inicio - Covarrubias & Cía]. arXiv. Sequence-Level Knowledge Distillation.

31. Furlanello, Tommaso, Zachary C. Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. 2018. “Born Again Neural Networks.” arXiv [stat.ML]. arXiv. http://arxiv.org/abs/1805.04770.

32. Yang, Chenglin, Lingxi Xie, Chi Su, and Alan L. Yuille. 2018. “Snapshot Distillation: Teacher-Student Optimization in One Generation.” https://www.semanticscholar.org/paper/a167d8a4ee261540c2b709dde2d94572c6ea3fc8.

33. Chen, Defang, Jian-Ping Mei, Can Wang, Yan Feng, and Chun Chen. 2019. “Online Knowledge Distillation with Diverse Peers.” arXiv [cs.LG]. arXiv. http://arxiv.org/abs/1912.00350.

-END -

推荐阅读：

极市平台视觉算法季度赛，提供真实应用场景数据和免费算力，特殊时期，一起在家打比赛吧！

添加极市小助手微信（ID : cv-mart），备注：研究方向-姓名-学校/公司-城市（如：目标检测-小极-北大-深圳），即可申请加入目标检测、目标跟踪、人脸、工业检测、医学影像、三维&SLAM、图像分割等极市技术交流群，更有每月大咖直播分享、真实项目需求对接、求职内推、算法竞赛、干货资讯汇总、行业技术交流，一起来让思想之光照的更远吧~