Neural network quantization is frequently used to optimize model size, latency and power consumption for on-device deployment of neural networks. In many cases, a target bit-width is set for an entire network, meaning every layer get quantized to the same number of bits. However, for many networks some layers are significantly more robust to quantization noise than others, leaving an important axis of improvement unused. As many hardware solutions provide multiple different bit-width settings, mixed-precision quantization has emerged as a promising solution to find a better performance-efficiency trade-off than homogeneous quantization. However, most existing mixed precision algorithms are rather difficult to use for practitioners as they require access to the training data, have many hyper-parameters to tune or even depend on end-to-end retraining of the entire model. In this work, we present a simple post-training mixed precision algorithm that only requires a small unlabeled calibration dataset to automatically select suitable bit-widths for each layer for desirable on-device performance. Our algorithm requires no hyper-parameter tuning, is robust to data variation and takes into account practical hardware deployment constraints making it a great candidate for practical use. We experimentally validate our proposed method on several computer vision tasks, natural language processing tasks and many different networks, and show that we can find mixed precision networks that provide a better trade-off between accuracy and efficiency than their homogeneous bit-width equivalents.
翻译:神经网络的孔化常被用来优化模型大小、 斜度和电流消耗, 以优化模型大小、 丁格和电耗, 以优化神经网络的配置。 在许多情况下, 整个网络都设定了目标的比位斜度, 意味着每个层的量位数都是相同的位数。 然而, 对于许多网络来说, 某些层的比层次比其他的要强得多得多, 使得微调的噪音比其它的要少。 许多硬件解决方案提供了多种不同的位维度设置。 由于许多硬件解决方案提供了多种不同的位维度设置, 混合的精度和电量化已经形成一个大有希望的解决方案, 以找到更好的性能效率交易交易交易交易的比等量交易的平衡。 然而, 大部分现有的混合精度精度算法對實性交易的精準化網絡都很难被使用, 因為執行者需要使用培训数据, 有很多超常的分數來調整, 甚至依靠整整個模式。 在这项工作中, 我們只需要一個小的未加標的校准的校准的校准的校准的校准的校正的數數, 找到每個層的數。