This paper presents a practical methodology for collecting performance data necessary to conduct hierarchical Roofline analysis on NVIDIA GPUs. It discusses the extension of the Empirical Roofline Toolkit for broader support of a range of data precisions and Tensor Core support and introduces a Nsight Compute based method to accurately collect application performance information. This methodology allows for automated machine characterization and application characterization for Roofline analysis across the entire memory hierarchy on NVIDIA GPUs, and it is validated by a complex deep learning application used for climate image segmentation. We use two versions of the code, in TensorFlow and PyTorch respectively, to demonstrate the use and effectiveness of this methodology. We highlight how the application utilizes the compute and memory capabilities on the GPU and how the implementation and performance differ in two deep learning frameworks.
翻译:本文件介绍了一种实用方法,用于收集对荷兰国家空间数据局GPU进行等级性Roofline分析所需的性能数据,讨论了扩大经验性Roofline工具包的范围,以便更广泛地支持一系列数据精确度和Tensor核心支持,并引入了基于Nsight计算法的方法,以准确收集应用性能信息。这一方法允许对荷兰国家空间数据局GPU的整个记忆级进行自动机器定性和应用性能定性,并通过用于气候图像分割的复杂的深层学习应用加以验证。我们分别在TensorFlow和PyTorch使用两种版本的代码,以展示这一方法的使用和有效性。我们强调应用如何利用GPU的计算和记忆能力,以及两个深层学习框架中的实施和绩效如何不同。