Tensor Cores have been an important unit to accelerate Fused Matrix Multiplication Accumulation (MMA) in all NVIDIA GPUs since Volta Architecture. To program Tensor Cores, users have to use either legacy wmma APIs or current mma APIs. Legacy wmma APIs are more easy-to-use but can only exploit limited features and power of Tensor Cores. Specifically, wmma APIs support fewer operand shapes and can not leverage the new sparse matrix multiplication feature of the newest Ampere Tensor Cores. However, the performance of current programming interface has not been well explored. Furthermore, the computation numeric behaviors of low-precision floating points (TF32, BF16, and FP16) supported by the newest Ampere Tensor Cores are also mysterious. In this paper, we explore the throughput and latency of current programming APIs. We also intuitively study the numeric behaviors of Tensor Cores MMA and profile the intermediate operations including multiplication, addition of inner product, and accumulation. All codes used in this work can be found in https://github.com/sunlex0717/DissectingTensorCores.
翻译:自 Volta 建筑以来, Tensor 核心是加速所有 NVIDIA GPU 中所有 NVIDIA GPU 中FFDI 倍增堆积( MMA) 的重要单位。 程序 Tensor 核心, 用户必须使用遗留的 Wmma API 或当前 mma API 。 传统 Wmma API 比较容易使用, 但只能利用 Tensor 核心的有限特性和力量。 具体来说, wmma APIs 支持较少的操作形状, 无法利用最新的 Ampere Tensor 核心的新的稀薄矩阵倍增特性。 然而, 目前的编程界面的性能没有得到很好的探索。 此外, 由最新的 Ampereetre Tens Cens 核心支持的计算数字行为( TTF32、 BF16 和 FP16) 也很神秘。 在本文中, 我们探索当前编程 APIPI 的吞吐器核心 和MMA 中间操作,, 包括 REDLAppresulationalations 。