Model quantization is a widely used technique to compress and accelerate deep neural network (DNN) inference. Emergent DNN hardware accelerators begin to support flexible bitwidth (1-8 bits) to further improve the computation efficiency, which raises a great challenge to find the optimal bitwidth for each layer: it requires domain experts to explore the vast design space trading off among accuracy, latency, power, and model size, which is both time-consuming and sub-optimal. Conventional quantization algorithm ignores the different hardware architectures and quantizes all the layers in an uniform way. In this paper, we introduce the Hardware-Aware Automated Quantization (HAQ) framework which leverages the reinforcement learning to automatically determine the quantization policy, and we take the hardware accelerator's feedback in the design loop. Rather than relying on proxy signals such as FLOPs and model size, we employ a hardware simulator to generate direct feedback signals to the RL agent. Compared with conventional methods, our framework is fully automated and can specialize the quantization policy for different neural network architectures and hardware architectures. Our framework effectively reduced the latency by 1.4-1.95x and the energy consumption by 1.9x with negligible loss of accuracy compared with the fixed bitwidth (8 bits) quantization. Our framework reveals that the optimal policies on different hardware architectures (i.e., edge and cloud architectures) under different resource constraints (i.e., latency, power and model size) are drastically different. We interpreted the implication of different quantization policies, which offer insights for both neural network architecture design and hardware architecture design.
翻译:模型量化是一种广泛使用的压缩和加速深神经网络( DNNN) 推导技术。 新兴 DNNN 硬硬件加速器开始支持灵活的比特度( 1-8 比特) 以进一步提高计算效率, 这提出了为每层找到最佳比特度的巨大挑战: 它需要域专家探索在精确度、 延缓度、 电力和模型大小之间进行巨大的设计空间交换, 它既耗时又亚优。 常规量化算法忽略了不同的硬件结构, 以统一的方式将所有层量化 。 在本文中, 我们引入了硬质- Aware 自动量化( HAQQ) 框架, 利用强化学习来自动确定每个层的最佳比特的比特位位位位位位值 。 我们在设计循环中选择硬件加速器的反馈。 而不是依靠FLOPs和模型大小等代理信号, 我们使用硬件模拟算法为 RL 代理商生成直接的反馈信号。 与常规方法相比, 我们的框架是完全自动化的, 我们的网络设计 自动度设计 自动和智能结构的精度解释, 可以将我们内部的精度结构 的精度 的精度结构 两种的精度化 。