FP8 与 INT8 在高效深度学习推断中的比较 (FP8 versus INT8 for efficient deep learning inference)

Mart van Baalen,Andrey Kuzmin,Suparna S Nair,Yuwei Ren,Eric Mahurin,Chirag Patel,Sundar Subramanian,Sanghyuk Lee,Markus Nagel,Joseph Soriaga,Tijmen Blankevoort

Recently, the idea of using FP8 as a number format for neural network training has been floating around the deep learning world. Given that most training is currently conducted with entire networks in FP32, or sometimes FP16 with mixed-precision, the step to having some parts of a network run in FP8 with 8-bit weights is an appealing potential speed-up for the generally costly and time-intensive training procedures in deep learning. A natural question arises regarding what this development means for efficient inference on edge devices. In the efficient inference device world, workloads are frequently executed in INT8. Sometimes going even as low as INT4 when efficiency calls for it. In this whitepaper, we compare the performance for both the FP8 and INT formats for efficient on-device inference. We theoretically show the difference between the INT and FP formats for neural networks and present a plethora of post-training quantization and quantization-aware-training results to show how this theory translates to practice. We also provide a hardware analysis showing that the FP formats are somewhere between 50-180% less efficient in terms of compute in dedicated hardware than the INT format. Based on our research and a read of the research field, we conclude that although the proposed FP8 format could be good for training, the results for inference do not warrant a dedicated implementation of FP8 in favor of INT8 for efficient inference. We show that our results are mostly consistent with previous findings but that important comparisons between the formats have thus far been lacking. Finally, we discuss what happens when FP8-trained networks are converted to INT8 and conclude with a brief discussion on the most efficient way for on-device deployment and an extensive suite of INT8 results for many models.

翻译：最近，使用 FP8 作为神经网络训练数字格式的想法在深度学习界中流传开来。考虑到大多数训练现在都是使用整个网络在 FP32 中进行的，或者有时使用 FP16 进行混合精度训练，让网络的某些部分在具有 8 位权重的 FP8 中运行对于通常昂贵且费时的深度学习训练过程是一个有吸引力的潜在加速。这引发了一个自然的问题，即这一发展对于边缘设备的高效推断意味着什么。在高效推断设备世界中，工作负载经常以 INT8 执行。有时，为了达到效率，甚至会降到 INT4 级别。在本文中，我们比较了 FP8 和 INT 格式的高效设备推断性能。我们从理论上展示了神经网络中 INT 和 FP 格式之间的差异，并提供了大量的后训练量化和量化感知训练结果，以展示这一理论如何转化为实践。我们还提供了一份硬件分析报告，显示在专用硬件上，FP 格式的计算效率比 INT 格式要低 50-180％。基于我们的研究和对研究领域的观察，我们得出结论，虽然提出的 FP8 格式可能对于训练有好处，但是推断结果并不支持单独实现 FP8 而不是 INT8 用于高效推断。我们展示了我们的结果在很大程度上与先前的研究结果一致，但是迄今为止缺少了关于这些格式之间的重要比较。最后，我们讨论了 FP8 训练的网络转换为 INT8 的情况，并以及讨论了部署设备的最有效方式，展示了多个模型的 INT8 结果套件。