While neural networks have advanced the frontiers in many machine learning applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is vital to integrating modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings, but the additional noise it induces can lead to accuracy degradation. In this white paper, we present an overview of neural network quantization using AI Model Efficiency Toolkit (AIMET). AIMET is a library of state-of-the-art quantization and compression algorithms designed to ease the effort required for model optimization and thus drive the broader AI ecosystem towards low latency and energy-efficient inference. AIMET provides users with the ability to simulate as well as optimize PyTorch and TensorFlow models. Specifically for quantization, AIMET includes various post-training quantization (PTQ, cf. chapter 4) and quantization-aware training (QAT, cf. chapter 5) techniques that guarantee near floating-point accuracy for 8-bit fixed-point inference. We provide a practical guide to quantization via AIMET by covering PTQ and QAT workflows, code examples and practical tips that enable users to efficiently and effectively quantize models using AIMET and reap the benefits of low-bit integer inference.
翻译:虽然神经网络在许多机器学习应用中已经发展了前沿,但它们的计算成本往往很高。降低神经网络推断的功率和纬度对于将现代网络纳入具有严格功率和计算要求的边缘装置至关重要。神经网络量度是实现这些节约的最有效方法之一,但它引起的额外噪音可能导致精确度下降。在本白皮书中,我们利用AI 模型效率工具包(AIMET)对神经网络量化作了概述。AIMET是一个最新量化和压缩算法图书馆,旨在便利模型优化所需的努力,从而将更广泛的AI生态系统推向低弹性和节能的推论。AIMET为用户提供了模拟和优化PyTorrch和TensorFlow模型的能力。具体来说,AIMET包括各种培训后量化(PQ,参考第4章)和二次量化-认知培训(QAT,参考第5章),这些技术保证了8比的固定和高能效的人工智能生态系统接近浮动精确度的精确度,我们提供了一种技术,通过AIT的定额和定额模型,通过AIT 提供可有效进行模拟和定额流化。