Model quantization helps to reduce model size and latency of deep neural networks. Mixed precision quantization is favorable with customized hardwares supporting arithmetic operations at multiple bit-widths to achieve maximum efficiency. We propose a novel learning-based algorithm to derive mixed precision models end-to-end under target computation constraints and model sizes. During the optimization, the bit-width of each layer / kernel in the model is at a fractional status of two consecutive bit-widths which can be adjusted gradually. With a differentiable regularization term, the resource constraints can be met during the quantization-aware training which results in an optimized mixed precision model. Further, our method can be naturally combined with channel pruning for better computation cost allocation. Our final models achieve comparable or better performance than previous quantization methods with mixed precision on MobilenetV1/V2, ResNet18 under different resource constraints on ImageNet dataset.
翻译:模型量化有助于降低深神经网络的模型大小和长度。 混合精度量化有利于定制硬件, 支持多种位宽的算术操作, 以达到最高效率。 我们提出一种新的基于学习的算法, 在目标计算限制和模型大小下得出混合精准模型端对端。 在优化过程中, 模型中每个层/ 内核的比特维度处于两个连续的位宽的分位状态, 可以逐步调整。 使用不同的正规化条件, 资源限制可以在量化测试培训期间解决, 从而形成最佳混合精确模型 。 此外, 我们的方法可以自然地与频道运行同步, 以更好地计算成本分配 。 我们的最终模型在移动netV1/V2、 ResNet18 和图像网络数据集的不同资源限制下, 实现与先前的精度混合精度的四分法相似或更好的性能 。