Neural network quantization is a promising compression technique to reduce memory footprint and save energy consumption, potentially leading to real-time inference. However, there is a performance gap between quantized and full-precision models. To reduce it, existing quantization approaches require high-precision INT32 or full-precision multiplication during inference for scaling or dequantization. This introduces a noticeable cost in terms of memory, speed, and required energy. To tackle these issues, we present F8Net, a novel quantization framework consisting of only fixed-point 8-bit multiplication. To derive our method, we first discuss the advantages of fixed-point multiplication with different formats of fixed-point numbers and study the statistical behavior of the associated fixed-point numbers. Second, based on the statistical and algorithmic analysis, we apply different fixed-point formats for weights and activations of different layers. We introduce a novel algorithm to automatically determine the right format for each layer during training. Third, we analyze a previous quantization algorithm -- parameterized clipping activation (PACT) -- and reformulate it using fixed-point arithmetic. Finally, we unify the recently proposed method for quantization fine-tuning and our fixed-point approach to show the potential of our method. We verify F8Net on ImageNet for MobileNet V1/V2 and ResNet18/50. Our approach achieves comparable and better performance, when compared not only to existing quantization techniques with INT32 multiplication or floating-point arithmetic, but also to the full-precision counterparts, achieving state-of-the-art performance.
翻译:神经网络量度是一种有希望的压缩技术,可以减少记忆足迹和节省能源消耗,有可能导致实时推断。然而,量化和全面精确模型之间存在性能差距。为了减少这种差距,现有的量化方法需要高精度 INT32 或推算过程中的完全精度乘法。这在记忆、速度和所需能量方面带来了显著的成本。为了解决这些问题,我们提出了F8Net,这是一个新的量化框架,仅包含8比特的固定点乘数。为了得出我们的方法,我们首先讨论固定点数乘以不同的固定点数字和完全精确度模型的性能差距。为了减少这种差距,现有的量化方法需要高精度的精确度 INT32 或完全精度乘数乘法。第二,根据统计和算法,我们对不同层次的重量和激活采用不同的固定点格式。我们引入了一种新的算法来自动确定每个层次的正确格式。第三,我们分析了以前的量化算法 -- 参数化剪辑激活(PACT) -- 并用不同的固定点计算方法对固定点进行重新配置。最后,我们用固定网络方法来进行精确的平调算。