High-performance computing (HPC) workloads are becoming increasingly diverse, exhibiting wide variability in job characteristics, yet cluster scheduling has long relied on static, heuristic-based policies. In this work we present SchedTwin, a real-time digital twin designed to adaptively guide scheduling decisions using predictive simulation. SchedTwin periodically ingests runtime events from the physical scheduler, performs rapid what-if evaluations of multiple policies using a high-fidelity discrete-event simulator, and dynamically selects the one satisfying the administrator configured optimization goal. We implement SchedTwin as an open-source software and integrate it with the production PBS scheduler. Preliminary results show that SchedTwin consistently outperforms widely used static scheduling policies, while maintaining low overhead (a few seconds per scheduling cycle). These results demonstrate that real-time digital twins offer a practical and effective path toward adaptive HPC scheduling.
翻译:高性能计算(HPC)工作负载正变得日益多样化,其作业特性表现出极大的差异性,然而集群调度长期以来依赖于静态的、基于启发式的策略。本文提出SchedTwin,一个旨在通过预测性仿真自适应指导调度决策的实时数字孪生系统。SchedTwin周期性地从物理调度器获取运行时事件,利用高保真离散事件仿真器对多种策略进行快速的假设性评估,并动态选择满足管理员配置优化目标的策略。我们将SchedTwin实现为一款开源软件,并将其与生产环境的PBS调度器集成。初步结果表明,SchedTwin在保持低开销(每个调度周期仅数秒)的同时,持续优于广泛使用的静态调度策略。这些结果证明,实时数字孪生为自适应HPC调度提供了一条切实有效的路径。