Performance tools for forthcoming heterogeneous exascale platforms must address two principal challenges when analyzing execution measurements. First, measurement of extreme-scale executions generates large volumes of performance data. Second, performance metrics for heterogeneous applications are significantly sparse across code regions. To address these challenges, we developed a novel "streaming aggregation" approach to post-mortem analysis that employs both shared and distributed memory parallelism to aggregate sparse performance measurements from every rank, thread and GPU stream of a large-scale application execution. Analysis results are stored in a pair of sparse formats designed for efficient access to related data elements, supporting responsive interactive presentation and scalable data analytics. Empirical analysis shows that our implementation of this approach in HPCToolkit effectively processes measurement data from thousands of threads using a fraction of the compute resources employed by the application itself. Our approach is able to perform analysis up to 9.4 times faster and store analysis results 23 times smaller than HPCToolkit, providing a key building block for scalable exascale performance tools.
翻译:即将推出的多元缩放平台的性能工具在分析执行量度时必须解决两个主要挑战。 首先,极端规模处决的计量生成大量性能数据。 其次,不同应用的性能指标在代码区域之间极为稀少。为了应对这些挑战,我们开发了一种新的死后分析“流集”方法,既采用共享和分布的记忆平行方法,又采用大规模应用执行的每个级别、线条和GPU流的零散性能总测量方法。分析结果储存在一对稀少的格式中,这些格式旨在有效获取相关数据要素,支持反应灵敏的互动演示和可缩放的数据分析。 经验分析表明,我们在HPCToolkit采用这一方法,利用应用程序本身使用的一部分计算资源,有效地处理数千条线索的测量数据。我们的方法能够进行分析,达到9.4倍的速率,储存分析结果比HPCToolkit小23倍,为可缩放的性性性性性性能工具提供了一个关键构件。