As Deep Neural Networks (DNNs) usually are overparameterized and have millions of weight parameters, it is challenging to deploy these large DNN models on resource-constrained hardware platforms, e.g., smartphones. Numerous network compression methods such as pruning and quantization are proposed to reduce the model size significantly, of which the key is to find suitable compression allocation (e.g., pruning sparsity and quantization codebook) of each layer. Existing solutions obtain the compression allocation in an iterative/manual fashion while finetuning the compressed model, thus suffering from the efficiency issue. Different from the prior art, we propose a novel One-shot Pruning-Quantization (OPQ) in this paper, which analytically solves the compression allocation with pre-trained weight parameters only. During finetuning, the compression module is fixed and only weight parameters are updated. To our knowledge, OPQ is the first work that reveals pre-trained model is sufficient for solving pruning and quantization simultaneously, without any complex iterative/manual optimization at the finetuning stage. Furthermore, we propose a unified channel-wise quantization method that enforces all channels of each layer to share a common codebook, which leads to low bit-rate allocation without introducing extra overhead brought by traditional channel-wise quantization. Comprehensive experiments on ImageNet with AlexNet/MobileNet-V1/ResNet-50 show that our method improves accuracy and training efficiency while obtains significantly higher compression rates compared to the state-of-the-art.
翻译:由于深神经网络(DNNS)通常被过度分解,且有数百万重量参数,因此在资源限制的硬件平台上部署这些大型DNN模型具有挑战性,例如智能手机。许多网络压缩方法,例如修剪和量化等,建议大幅缩小模型的大小,关键是找到每一层的适当压缩分配(例如修剪宽度和量化代码书)。现有解决方案在微调压缩模型的同时,以迭代/人工方式获得压缩分配,从而受效率问题的影响。与先前的艺术不同,我们提议在本文中采用新颖的 One-shot Prutning-量化(OPQ) 模型。在微调过程中,我们建议通过不使用常规版本版本的平面化前置精度配置方法,在每部域网内进行统一的平整流化,然后通过普通版本的平流式平流化方法,在微的平流离式版本中,通过普通版本的平流式平流式平流式平流式平流式平流式平流式平流式平流。