Tensor Cores have been an important unit to accelerate Fused Matrix Multiplication Accumulation (MMA) in all NVIDIA GPUs since Volta Architecture. To program Tensor Cores, users have to use either legacy wmma APIs or current mma APIs. Legacy wmma APIs are more easy-to-use but can only exploit limited features and power of Tensor Cores. Specifically, wmma APIs support fewer operand shapes and can not leverage the new sparse matrix multiplication feature of the newest Ampere Tensor Cores. However, the performance of current programming interface has not been well explored. Furthermore, the computation numeric behaviors of low-precision floating points (TF32, BF16 and FP16) supported by newest Ampere Tensor Cores are also mysterious. In this paper, we explore the throughput and latency of current programming APIs. We also intuitively study the numeric behaviors of Tensor Cores MMA and profile the intermediate operations including multiplication, addition of inner product and addition of accumulation.
翻译:自 Volta 建筑以来, Tensor 核心是加速所有 NVIDIA GPU 中所有 NVIDIA GPU 中FFDI 矩阵乘法累积( MMA) 的重要单位。 程序 Tensor 核心, 用户必须使用遗留的 Wmma API 或当前 mma API 。 传统 Wmma API 比较容易使用, 但只能利用 Tensor 核心的有限特性和力量。 具体来说, wmma API 支持较少的操作形状, 无法利用最新的 Ampere Tensor 核心的新的稀薄矩阵乘法特性。 但是, 目前的编程界面的性能没有得到很好的探索。 此外, 由新的 Ampeore Tensor Tensor 核心 支持的低精度浮点( TF32、 BF16 和 FP16) 的计算数字行为( ) 也是神秘的。 在本文中, 我们探索当前编程 API 的吞没 的 。 我们还直观地研究 Tensors 核心 的 和配置中间操作, 包括 内产的增、 加增 。