NVIDIA Tensor Core 精确模型 (Accurate Models of NVIDIA Tensor Cores)

Matrix multiplication is a fundamental operation in for both training of neural networks and inference. To accelerate matrix multiplication, Graphical Processing Units (GPUs) provide it implemented in hardware. Due to the increased throughput over the software-based matrix multiplication, the multipliers are increasingly used outside of AI, to accelerate various applications in scientific computing. However, matrix multipliers targeted at AI are at present not compliant with IEEE 754 floating-point arithmetic behaviour, with different vendors offering different numerical features. This leads to non-reproducible results across different generations of GPU architectures, at the matrix multiply-accumulate instruction level. To study numerical characteristics of matrix multipliers-such as rounding behaviour, accumulator width, normalization points, extra carry bits, and others-test vectors are typically constructed. Yet, these vectors may or may not distinguish between different hardware models, and due to limited hardware availability, their reliability across many different platforms remains largely untested. We present software models for emulating the inner product behavior of low- and mixed-precision matrix multipliers in the V100, A100, H100 and B200 data center GPUs in most supported input formats of interest to mixed-precision algorithm developers: 8-, 16-, and 19-bit floating point.

翻译：矩阵乘法是神经网络训练与推理中的基础运算。为加速矩阵乘法，图形处理器（GPU）提供了硬件实现的矩阵乘法单元。由于相较于基于软件的矩阵乘法实现了更高的吞吐量，此类乘法器正日益在人工智能领域之外被用于加速科学计算中的各类应用。然而，当前面向人工智能的矩阵乘法器并不完全符合 IEEE 754 浮点运算行为规范，不同厂商提供的数值特性各异。这导致了在不同代际的 GPU 架构之间，在矩阵乘加指令层面存在结果不可复现的问题。为研究矩阵乘法器的数值特性（如舍入行为、累加器宽度、归一化点、额外进位位等），通常需要构建测试向量。然而，这些向量可能无法有效区分不同的硬件模型，且受限于硬件可用性，其在众多不同平台上的可靠性在很大程度上仍未得到充分验证。本文提出了软件模型，用于模拟 V100、A100、H100 和 B200 数据中心 GPU 中低精度与混合精度矩阵乘法器的内积行为，覆盖混合精度算法开发者关注的大多数受支持输入格式：8位、16位和19位浮点数。

相关内容