This paper presents some of the current challenges in designing deep learning artificial intelligence (AI) and integrating it with traditional high-performance computing (HPC) simulations. We evaluate existing packages for their ability to run deep learning models and applications on large-scale HPC systems efficiently, identify challenges, and propose new asynchronous parallelization and optimization techniques for current large-scale heterogeneous systems and upcoming exascale systems. These developments, along with existing HPC AI software capabilities, have been integrated into MagmaDNN, an open-source HPC deep learning framework. Many deep learning frameworks are targeted at data scientists and fall short in providing quality integration into existing HPC workflows. This paper discusses the necessities of an HPC deep learning framework and how those needs can be provided (e.g., as in MagmaDNN) through a deep integration with existing HPC libraries, such as MAGMA and its modular memory management, MPI, CuBLAS, CuDNN, MKL, and HIP. Advancements are also illustrated through the use of algorithmic enhancements in reduced- and mixed-precision, as well as asynchronous optimization methods. Finally, we present illustrations and potential solutions for enhancing traditional compute- and data-intensive applications at ORNL and UTK with AI. The approaches and future challenges are illustrated in materials science, imaging, and climate applications.
翻译:本文介绍了在设计深学习人工智能(AI)和将其与传统的高性能计算(HPC)模拟相结合方面的一些当前挑战。我们评估了现有一揽子方案,以确定其是否有能力高效率地运行关于大规模高常委会系统的深学习模式和应用,查明挑战,并为目前的大型多元系统和即将推出的缩略系统提出新的非同步平行和优化技术。这些发展,连同现有的高常委会AI软件能力,已经纳入一个开放源码高常委会深层次学习框架MagmaDNN,许多深层次的学习框架都针对数据科学家,在向现有高常委会工作流程提供高质量整合方面做得不够。本文讨论了高常委会深层学习框架的必要性以及如何通过与现有高常委会图书馆的深度整合(如在MagmaDNNNNN),如MAG及其模块存储管理、MPI、CUBLAS、CUDNNN、MKL和HIP等现有软件能力。通过在减少和混合精度精度和混合精度高精度科学应用中采用算改进方法,以及高精度高度的科技应用,以及高度科技和高度科技应用方法,从而展示了这些技术。最后,我们目前在高度科学和高度应用中,与高度科学和高度科学和高度应用中,与高度利用了高压度科学和高压度科学和高压性压化技术。