This paper studies the utility of using data analytics and machine learning techniques for identifying, classifying, and characterizing the dynamics of large-scale parallel (MPI) programs. To this end, we run microbenchmarks and realistic proxy applications with the regular compute-communicate structure on two different supercomputing platforms and choose the per-process performance and MPI time per time step as relevant observables. Using principal component analysis, clustering techniques, correlation functions, and a new "phase space plot," we show how desynchronization patterns (or lack thereof) can be readily identified from a data set that is much smaller than a full MPI trace. Our methods also lead the way towards a more general classification of parallel program dynamics.
翻译:本文研究使用数据分析和机器学习技术来识别、分类和定性大型平行(MPI)程序动态的实用性。 为此,我们在两个不同的超级计算平台上运行微型基准和现实的代理应用程序,同时使用常规的计算communicate 结构,并选择每个过程的性能和每个时段的MPI时间作为相关的可观测数据。 我们使用主要组成部分分析、组合技术、相关功能和一个新的“相位空间图 ”, 我们展示了如何从远小于完整 MPI 跟踪的数据集中很容易地识别脱同步模式( 或缺乏这种模式 ) 。 我们的方法还引导了对平行程序动态进行更普遍的分类。