High-performance computing (HPC) systems frequently experience congestion leading to significant application performance variation. However, the impact of congestion on application runtime differs from application to application depending on their network characteristics (such as bandwidth and latency requirements). We leverage this insight to develop Netscope, an automated ML-driven framework that considers those network characteristics to dynamically mitigate congestion. We evaluate Netscope on four Cray Aries systems, including a production supercomputer on real scientific applications. Netscope has a lower training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7and 0.9 for common scientific applications. Moreover, we find that Netscope reduces tail runtime variability by up to 14.9 times while improving median system utility by 12%.
翻译:高性能计算(HPC)系统经常出现拥堵,导致应用性能差异很大;然而,根据应用的运行时间因应用而异,其影响因应用而异,取决于其网络特性(例如带宽和延时要求);我们利用这一洞察力开发Netscope,这是一个自动ML驱动的框架,认为这些网络特点能动态缓解拥塞;我们评估四个Cray Aries系统的网络镜,包括一个生产超计算机用于实际科学应用;Netscope的训练费用较低,准确估计了在应用运行期间的拥塞影响,与0.7和0.9之间对普通科学应用的关联;此外,我们发现Netscope将尾部运行时间变化减少14.9倍,同时将中位系统功率提高12%。