While neural networks have advanced the frontiers in many applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is key if we want to integrate modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings but the additional noise it induces can lead to accuracy degradation. In this white paper, we introduce state-of-the-art algorithms for mitigating the impact of quantization noise on the network's performance while maintaining low-bit weights and activations. We start with a hardware motivated introduction to quantization and then consider two main classes of algorithms: Post-Training Quantization (PTQ) and Quantization-Aware-Training (QAT). PTQ requires no re-training or labelled data and is thus a lightweight push-button approach to quantization. In most cases, PTQ is sufficient for achieving 8-bit quantization with close to floating-point accuracy. QAT requires fine-tuning and access to labeled training data but enables lower bit quantization with competitive results. For both solutions, we provide tested pipelines based on existing literature and extensive experimentation that lead to state-of-the-art performance for common deep learning models and tasks.
翻译:虽然神经网络在许多应用中已经发展了前沿,但它们往往以高计算成本达到。如果我们想要将现代网络纳入具有严格功率和计算要求的边缘装置,那么降低神经网络推断的力量和延迟度是关键。神经网络量化是实现这些节约的最有效方法之一,但是它引起的额外噪音可能导致精确度下降。在本白皮书中,我们引入了降低四分制噪音对网络性能影响的最先进的算法,同时保持低比重和活化。我们首先从硬件入门到量化,然后考虑两大类算法:培训后量化(PTQ)和量化-软件培训(QAT)。 PTQ不需要对数据进行再培训或贴标签的噪音,因此,它是一种较轻的推键方法。在多数情况下,PTQ足以实现8比四分制,同时保持近浮点精确度。 QAT需要精确的硬质化引入量化,然后考虑两大类算法:培训后量化(PTQQQ)和量化-软件培训(QAT)。 PTQQ不需要再进行再培训,而是以普通的实验性测试,我们现有的实验结果。