Graphics processing units (GPUs) are now considered the leading hardware to accelerate general-purpose workloads such as AI, data analytics, and HPC. Over the last decade, researchers have focused on demystifying and evaluating the microarchitecture features of various GPU architectures beyond what vendors reveal. This line of work is necessary to understand the hardware better and build more efficient workloads and applications. Many works have studied the recent Nvidia architectures, such as Volta and Turing, comparing them to their successor, Ampere. However, some microarchitecture features, such as the clock cycles for the different instructions, have not been extensively studied for the Ampere architecture. In this paper, we study the clock cycles per instructions with various data types found in the instruction-set architecture (ISA) of Nvidia GPUs. Using microbenchmarks, we measure the clock cycles for PTX ISA instructions and their SASS ISA instructions counterpart. we further calculate the clock cycle needed to access each memory unit. We also demystify the new version of the tensor core unit found in the Ampere architecture by using the WMMA API and measuring its clock cycles per instruction and throughput for the different data types and input shapes. The results found in this work should guide software developers and hardware architects. Furthermore, the clock cycles per instructions are widely used by performance modeling simulators and tools to model and predict the performance of the hardware.
翻译:图形处理单位(GPU)现在被视为加速一般工作量的主要硬件,例如AI、数据分析器和HPC。 在过去的十年中,研究人员侧重于解析和评估各种GPU结构中超出供应商所显示的微结构的微结构特征。这种工作线对于更好地了解硬件和建设更有效的工作量和应用是必要的。许多工作研究了最近的Nvidia结构,如Volta和Turing, 将它们与后续机构Ampere比较。但是,一些微结构特征,如不同指令的时钟周期等,还没有为Ampere结构进行广泛研究。在本文件中,我们研究每个指示周期的时钟周期与Nvidia GPUS的指令结构(ISA)中发现的各种数据类型。我们用微信标测量PTX ISA 指令及其SAS 时钟指令的时钟周期。我们进一步计算每个记忆单元所需的时钟周期。我们还广泛研究Ampretical Centricle CD核心单位的新版本,在AMA 和SPIDS 指令周期中,通过不同的数据格式,在AMA 和SPIDS 格式中发现的系统中,通过数据输入结果, 和SDral 。