When quantizing neural networks for efficient inference, low-bit integers are the go-to format for efficiency. However, low-bit floating point numbers have an extra degree of freedom, assigning some bits to work on an exponential scale instead. This paper in-depth investigates this benefit of the floating point format for neural network inference. We detail the choices that can be made for the FP8 format, including the important choice of the number of bits for the mantissa and exponent, and show analytically in which settings these choices give better performance. Then we show how these findings translate to real networks, provide an efficient implementation for FP8 simulation, and a new algorithm that enables the learning of both the scale parameters and the number of exponent bits in the FP8 format. Our chief conclusion is that when doing post-training quantization for a wide range of networks, the FP8 format is better than INT8 in terms of accuracy, and the choice of the number of exponent bits is driven by the severity of outliers in the network. We also conduct experiments with quantization-aware training where the difference in formats disappears as the network is trained to reduce the effect of outliers.
翻译:当对神经网络进行量化以进行有效的推断时,低位数整数是效率的上到格式。 但是, 低位浮点数具有额外自由度, 将一些位数分配到指数尺度上。 本文深入调查了神经网络推导浮点格式的好处。 我们详细说明了FP8格式可以作出的选择, 包括选择曼蒂萨和Expent的比特数量的重要选择, 并用分析方式显示这些选择在何种情况下产生更好的性能。 然后我们展示这些结果如何转化成真实网络, 为 FP8 模拟提供高效的实施, 以及一种新的算法, 使 FP8 格式既能了解比例参数, 也能了解浮点数。 我们的主要结论是, 当对广泛的网络进行后期培训测试时, FP8 格式在准确性方面比 INT8 更好, 以及 引用点数的选择由网络外端点的严重性驱动 。 我们还进行实验, 以 Questrial- Exporization as the diversation of diversal diversation distrations