Many HPC applications suffer from a bottleneck in the shared caches, instruction execution units, I/O or memory bandwidth, even though the remaining resources may be underutilized. It is hard for developers and runtime systems to ensure that all critical resources are fully exploited by a single application, so an attractive technique for increasing HPC system utilization is to colocate multiple applications on the same server. When applications share critical resources, however, contention on shared resources may lead to reduced application performance. In this paper, we show that server efficiency can be improved by first modeling the expected performance degradation of colocated applications based on measured hardware performance counters, and then exploiting the model to determine an optimized mix of colocated applications. This paper presents a new intelligent resource manager and makes the following contributions: (1) a new machine learning model to predict the performance degradation of colocated applications based on hardware counters and (2) an intelligent scheduling scheme deployed on an existing resource manager to enable application co-scheduling with minimum performance degradation. Our results show that our approach achieves performance improvements of 7% (avg) and 12% (max) compared to the standard policy commonly used by existing job managers.
翻译:许多HPC应用程序在共享缓存、指示执行单位、I/O或记忆带宽中都存在瓶颈,尽管剩余资源可能利用不足,但开发者和运行时间系统很难确保所有关键资源都由一个应用程序充分开发,因此,增加HPC系统利用的吸引技术是在同一服务器上将多个应用程序合在一起。但是,当应用程序共享关键资源时,共享资源争议可能导致应用程序性能下降。在本文中,我们表明,服务器效率可以通过首先根据计量的硬件性能计数对共用应用程序的预期性能退化进行模拟来提高,然后利用该模型确定共享应用程序的最佳组合。本文介绍了一个新的智能资源管理者,并做出了以下贡献:(1) 一个新的机器学习模型,用以预测基于硬件计数器的合用同一应用程序的性能退化,(2) 在一个现有资源管理者上部署的一个智能的时间安排计划,以便能够将应用程序与最低性能退化混在一起。我们的方法表明,与现有职位管理人员通常使用的标准政策相比,我们的方法取得了7%(avg)和12%(max)的性能改进。