《有线电视新闻网培训量化》 (Distribution Adaptive INT8 Quantization for Training CNNs)

Researches have demonstrated that low bit-width (e.g., INT8) quantization can be employed to accelerate the inference process. It makes the gradient quantization very promising since the backward propagation requires approximately twice more computation than forward one. Due to the variability and uncertainty of gradient distribution, a lot of methods have been proposed to attain training stability. However, most of them ignore the channel-wise gradient distributions and the impact of gradients with different magnitudes, resulting in the degradation of final accuracy. In this paper, we propose a novel INT8 quantization training framework for convolutional neural network to address the above issues. Specifically, we adopt Gradient Vectorized Quantization to quantize the gradient, based on the observation that layer-wise gradients contain multiple distributions along the channel dimension. Then, Magnitude-aware Clipping Strategy is introduced by taking the magnitudes of gradients into consideration when minimizing the quantization error, and we present a theoretical derivation to solve the quantization parameters of different distributions. Experimental results on broad range of computer vision tasks, such as image classification, object detection and video classification, demonstrate that the proposed Distribution Adaptive INT8 Quantization training method has achieved almost lossless training accuracy for different backbones, including ResNet, MobileNetV2, InceptionV3, VGG and AlexNet, which is superior to the state-of-the-art techniques. Moreover, we further implement the INT8 kernel that can accelerate the training iteration more than 200% under the latest Turing architecture, i.e., our method excels on both training accuracy and speed.

翻译：研究表明,可以使用低位宽度(例如INT8)的量化来加速推断过程。它使得梯度量化非常有希望,因为后向传播需要大约两倍的计算。由于梯度分布的变异性和不确定性,提出了许多方法来实现培训稳定性。然而,其中多数方法忽略了频道偏差梯度分布和不同程度梯度的影响,导致最终准确度的退化。在本文中,我们提议为变动神经网络提供一个新的INT8量化培训框架,以解决上述问题。具体地说,我们采用梯度量化量化以量化梯度的计算比前向的计算高出大约两倍。基于对地层梯度分布的观察,提出了许多方法来实现培训稳定性稳定。随后,磁度感知缩战略通过在最小化误差时将梯度考虑在内,我们提出了一个理论推算,以解不同分布的量化参数。在计算机视野的广域域网域网域中, 实验结果几乎是精度量化梯度技术的量化值,例如图像分类、图解变精度、变精度培训方法,我们最新的变精度、变精度、变精度、变精度、变精度培训方法、变精度、变精度、变精度、变精度、变精度战略、变精度战略、图方法、图、变精度战略、图、变精度战略、变精度、图方法、图解、变精度培训、图解、图解、图解、变精度、图解、图方法、图解、变精度训练、变精度、图解、图解方法、图解、图解、图解、图解方法、图解、图、图、图解、图、图解方法、图、图、图解、图解、图解、图、图、图解、图解、图解、图解、图解、图解、图解、图解、图解、图解、图解、图解、图解、图解、图解、图解、图解、图、图、图解方法、图解方法、图解方法、图解方法、图、图、图、图、图解、图解、图解、图解、图解方法、图解方法、图解方法、图解、图解、图解、图解