MLPerf HPC:HPC系统科学机器学习综合基准套件 (MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems)

Steven Farrell,Murali Emani,Jacob Balma,Lukas Drescher,Aleksandr Drozd,Andreas Fink,Geoffrey Fox,David Kanter,Thorsten Kurth,Peter Mattson,Dawei Mu,Amit Ruhela,Kento Sato,Koichi Shirahata,Tsuguchika Tabaru,Aristeidis Tsaris,Jan Balewski,Ben Cumming,Takumi Danjo,Jens Domke,Takaaki Fukai,Naoto Fukumoto,Tatsuya Fukushi,Balazs Gerofi,Takumi Honda,Toshiyuki Imamura,Akihiko Kasagi,Kentaro Kawakami,Shuhei Kudo,Akiyoshi Kuroda,Maxime Martinasso,Satoshi Matsuoka,Henrique Mendonça,Kazuki Minami,Prabhat Ram,Takashi Sawada,Mallikarjun Shankar,Tom St. John,Akihiro Tabuchi,Venkatram Vishwanath,Mohamed Wahib,Masafumi Yamazaki,Junqi Yin

Scientific communities are increasingly adopting machine learning and deep learning models in their applications to accelerate scientific insights. High performance computing systems are pushing the frontiers of performance with a rich diversity of hardware resources and massive scale-out capabilities. There is a critical need to understand fair and effective benchmarking of machine learning applications that are representative of real-world scientific use cases. MLPerf is a community-driven standard to benchmark machine learning workloads, focusing on end-to-end performance metrics. In this paper, we introduce MLPerf HPC, a benchmark suite of large-scale scientific machine learning training applications driven by the MLCommons Association. We present the results from the first submission round, including a diverse set of some of the world's largest HPC systems. We develop a systematic framework for their joint analysis and compare them in terms of data staging, algorithmic convergence, and compute performance. As a result, we gain a quantitative understanding of optimizations on different subsystems such as staging and on-node loading of data, compute-unit utilization, and communication scheduling, enabling overall $>10 \times$ (end-to-end) performance improvements through system scaling. Notably, our analysis shows a scale-dependent interplay between the dataset size, a system's memory hierarchy, and training convergence that underlines the importance of near-compute storage. To overcome the data-parallel scalability challenge at large batch sizes, we discuss specific learning techniques and hybrid data-and-model parallelism that are effective on large systems. We conclude by characterizing each benchmark with respect to low-level memory, I/O, and network behavior to parameterize extended roofline performance models in future rounds.

翻译：高性能计算系统正在以丰富的硬件资源和大规模扩展能力推动业绩前沿。我们非常需要理解代表现实世界科学使用案例的机器学习应用程序的公平和有效基准。MLPerf是衡量机器学习工作量的社区驱动标准,重点是端到端的业绩计量。在本文件中,我们引入了MLPerf HPC,这是由刚果解放运动协会驱动的大规模科学机器学习培训应用的基准套件。我们介绍了第一轮提交的结果,包括世界上一些最大HPC系统的多样化。我们开发了一个系统框架,用于联合分析并比较这些应用,这些应用可代表现实世界科学使用案例。MLPerf是衡量机器学习工作量的社区驱动标准,重点是端到端到端的业绩计量。我们从数量上理解了对不同子系统的优化,例如数据中继和在端上装入数据,对混合单位的利用,以及通信时间安排,使得整个10美元的时间(端到端),包括世界最大一级HPC系统的一些系统;我们开发了一个系统联合分析系统,从数据运行到升级到升级的系统;我们通过升级的系统,从规模数据分析,从规模到升级到升级到升级的系统。