Jobs on high-performance computing (HPC) clusters can suffer significant performance degradation due to inter-job network interference. Topology-aware job allocation problem (TJAP) is such a problem that decides how to dedicate nodes to specific applications to mitigate inter-job network interference. In this paper, we study the window-based TJAP on a fat-tree network aiming at minimizing the cost of communication hop, a defined inter-job interference metric. The window-based approach for scheduling repeats periodically taking the jobs in the queue and solving an assignment problem that maps jobs to the available nodes. Two special allocation strategies are considered, i.e., static continuity assignment strategy (SCAS) and dynamic continuity assignment strategy (DCAS). For the SCAS, a 0-1 integer programming is developed. For the DCAS, an approach called neural simulated algorithm (NSA), which is an extension to simulated algorithm (SA) that learns a repair operator and employs them in a guided heuristic search, is proposed. The efficacy of NSA is demonstrated with a computational study against SA and SCIP. The results of numerical experiments indicate that both the model and algorithm proposed in this paper are effective.
翻译:高性能计算(HPC)组群的工作可能因工作间网络干扰而出现显著的性能退化。 地形意识工作分配问题(TJAP)是一个问题,决定了如何将节点用于减少工作间网络干扰的具体应用。在本文中,我们研究了在脂肪树网络上以窗口为基础的TJAP,目的是尽量减少通信跳的成本,一种界定的跨工作干扰度指标。基于窗口的时间安排办法是,在排队中重复工作,并解决分配问题,将工作映射到现有的节点。考虑了两个特殊分配战略,即静态连续分配战略和动态连续分配战略。对于SCAS,制定了0-1整数程序。对于DCAS,一种称为神经模拟算法(NSA)的方法,这是模拟算算法(SA)的延伸,该算法是学习修理操作员并将其用于有指导的超感力搜索。通过对SA和SCIP进行计算研究,可以证明NSA的功效。数字实验的结果表明,该模型和拟议的算法都是本文中提议的有效。