Distributed cloud environments hosting data-intensive applications often experience slowdowns due to network congestion, asymmetric bandwidth, and inter-node data shuffling. These factors are typically not captured by traditional host-level metrics like CPU or memory. Scheduling without accounting for these conditions can lead to poor placement decisions, longer data transfers, and suboptimal job performance. We present a network-aware job scheduler that uses supervised learning to predict the completion time of candidate jobs. Our system introduces a prediction-and-ranking mechanism that collects real-time telemetry from all nodes, uses a trained supervised model to estimate job duration per node, and ranks them to select the best placement. We evaluate the scheduler on a geo-distributed Kubernetes cluster deployed on the FABRIC testbed by running network-intensive Spark workloads. Compared to the default Kubernetes scheduler, which makes placement decisions based on current resource availability alone, our proposed supervised scheduler achieved 34-54% higher accuracy in selecting optimal nodes for job placement. The novelty of our work lies in the demonstration of supervised learning for real-time, network-aware job scheduling on a multi-site cluster.
翻译:托管数据密集型应用的分布式云环境常因网络拥塞、非对称带宽和节点间数据混洗而出现性能下降。这些因素通常未被CPU或内存等传统主机级指标所捕获。忽略这些条件的调度可能导致不良的放置决策、更长的数据传输时间以及次优的作业性能。本文提出一种网络感知的作业调度器,采用监督学习来预测候选作业的完成时间。该系统引入预测-排序机制:收集所有节点的实时遥测数据,使用训练好的监督模型估算作业在各节点的执行时长,并通过排序选择最优放置位置。我们在部署于FABRIC测试平台的跨地域Kubernetes集群上,通过运行网络密集型Spark工作负载对该调度器进行评估。与仅基于当前资源可用性做出放置决策的默认Kubernetes调度器相比,我们提出的监督调度器在作业最优节点选择准确率上提升了34-54%。本研究的创新性在于首次展示了监督学习在多站点集群中实现实时网络感知作业调度的可行性。