We consider the problem of model compression for deep neural networks (DNNs) in the challenging post-training setting, in which we are given an accurate trained model, and must compress it without any retraining, based only on a small amount of calibration input data. This problem has become popular in view of the emerging software and hardware support for executing models compressed via pruning and/or quantization with speedup, and well-performing solutions have been proposed independently for both compression approaches. In this paper, we introduce a new compression framework which covers both weight pruning and quantization in a unified setting, is time- and space-efficient, and considerably improves upon the practical performance of existing post-training methods. At the technical level, our approach is based on the first exact and efficient realization of the classical Optimal Brain Surgeon (OBS) framework of [LeCun, Denker, and Solla, 1990] at the scale of modern DNNs, which we further extend to cover weight quantization. This is enabled by a series of algorithmic developments which may be of independent interest. From the practical perspective, our experimental results show that it can improve significantly upon the compression-accuracy trade-offs of existing post-training methods, and that it can even enable the accurate joint application of both pruning and quantization in a post-training setting.
翻译:在具有挑战性的训练后环境下,我们考虑深神经网络的模型压缩问题,在这种环境下,我们得到一个准确的、经过培训的模型,并且必须在不经过任何再培训的情况下,根据少量校准输入数据加以压缩。鉴于对执行模型的软件和硬件支持正在形成,通过快速的裁剪和/或量化进行压缩,并且已经独立地提出了两种压缩方法的完善解决方案。在本文件中,我们引入了一个新的压缩框架,既包括在一个统一的环境下的重量计分和量化,又具有时间和空间效率,并且必须大大改进现有训练后方法的实际绩效。在技术层面,我们的方法基于在现代[莱肯、登克和索拉][勒肯、登克和索拉]的经典最佳脑外生框架的首次准确和高效实现,在现代调压方法的尺度上,我们进一步扩展到重量计分。这得益于一系列可能具有独立兴趣的算法发展。从实际的角度看,我们的实验性结果是建立在(甚至)后期后期培训中可以大大改进的。