MLPerfTM HPC:HPC系统科学机器学习综合基准套件 (MLPerfTM HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems)

Steven Farrell,Murali Emani,Jacob Balma,Lukas Drescher,Aleksandr Drozd,Andreas Fink,Geoffrey Fox,David Kanter,Thorsten Kurth,Peter Mattson,Dawei Mu,Amit Ruhela,Kento Sato,Koichi Shirahata,Tsuguchika Tabaru,Aristeidis Tsaris,Jan Balewski,Ben Cumming,Takumi Danjo,Jens Domke,Takaaki Fukai,Naoto Fukumoto,Tatsuya Fukushi,Balazs Gerofi,Takumi Honda,Toshiyuki Imamura,Akihiko Kasagi,Kentaro Kawakami,Shuhei Kudo,Akiyoshi Kuroda,Maxime Martinasso,Satoshi Matsuoka,Henrique Mendonc,Kazuki Minami,Prabhat Ram,Takashi Sawada,Mallikarjun Shankar,Tom St. John,Akihiro Tabuchi,Venkatram Vishwanath,Mohamed Wahib,Masafumi Yamazaki,Junqi Yin

Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning applications that are representative of real-world scientific use cases. MLPerfTM is a community-driven standard to benchmark machine learning workloads, focusing on end-to-end performance metrics. In this paper, we introduce MLPerf HPC, a benchmark suite of largescale scientific machine learning training applications, driven by the MLCommonsTM Association. We present the results from the first submission round including a diverse set of some of the world's largest HPC systems. We develop a systematic framework for their joint analysis and compare them in terms of data staging, algorithmic convergence, and compute performance. As a result, we gain a quantitative understanding of optimizations on different subsystems such as staging and on-node loading of data, compute-unit utilization, and communication scheduling enabling overall > 10x (end-to-end) performance improvements through system scaling. Notably, our analysis shows a scale-dependent interplay between the dataset size, a system's memory hierarchy, and training convergence that underlines the importance of near compute storage. To overcome the data-parallel scalability challenge at large batch sizes, we discuss specific learning techniques and hybrid data-and-model parallelism that are effective on large systems. We conclude by characterizing each benchmark with respect to low-level memory, I/O, and network behavior to parameterize extended roofline performance models in future rounds.

翻译：高性能计算系统正在以丰富的硬件资源和大规模扩展能力推动业绩前沿。我们非常需要理解代表现实世界科学使用案例的机器学习应用程序的公平和有效基准。MLPerfTM是一个社区驱动的标准,用以衡量机器学习工作量,重点是端到端的业绩计量。在本文件中,我们引入了MLPerf HPC,这是由MOLommonsTM协会驱动的大型双轨性科学机器学习培训应用的基准套件。我们展示了第一轮提交的结果,包括世界上一些最大HPC系统的各种性能。我们开发了一个系统化框架,用于联合分析并比较这些应用,这些应用可代表真实世界科学使用案例。MLPerfTM是衡量机器学习工作量的一个社区驱动标准,侧重于端到端到端的业绩计量。我们从数量上理解了对不同子系统优化,例如数据中继和在端装入数据、计算单位使用率低比值,以及通信进度安排,使总体 > 10x(端到端) 业绩改进了每轮提交一次全球最大HPC的系统。我们的业绩分析展示了在大规模存储系统上的规模。我们的数据比例分析显示,在具体存储系统上,我们的数据质量分析显示了规模的升级的升级的升级的进度,我们的数据在排序中,在排序和升级系统之间,在缩小的顺序上,我们的数据分析显示了规模上,在缩小了规模上,我们的数据级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级级的升级了一个数据级级级级的升级了比。