The growing demand for computational resources in machine learning has made efficient resource allocation a critical challenge, especially in heterogeneous hardware clusters where devices vary in capability, age, and energy efficiency. Upgrading to the latest hardware is often infeasible, making sustainable use of existing, mixed-generation resources essential. In this paper, we propose a learning-based architecture for managing machine learning workloads in heterogeneous clusters. The system operates online, allocating resources to incoming training or inference requests while minimizing energy consumption and meeting performance requirements. It uses two neural networks: the first provides initial estimates of how well a new model will utilize different hardware types and how it will affect co-located models. An optimizer then allocates resources based on these estimates. After deployment, the system monitors real performance and uses this data to refine its predictions via a second neural network. This updated model improves estimates not only for the current hardware but also for hardware not initially allocated and for co-location scenarios not yet observed. The result is an adaptive, iterative approach that learns over time to make more effective resource allocation decisions in heterogeneous deep learning clusters.
翻译:机器学习对计算资源日益增长的需求使得高效资源分配成为一个关键挑战,尤其在由不同性能、代际和能效设备组成的异构硬件集群中。升级至最新硬件通常不可行,因此可持续利用现有混合代际资源至关重要。本文提出一种基于学习的架构,用于管理异构集群中的机器学习工作负载。该系统在线运行,为到达的训练或推理请求分配资源,同时最小化能耗并满足性能要求。它采用两个神经网络:第一个网络提供新模型在不同硬件类型上的利用效率及其对共置模型影响的初步估计。随后优化器基于这些估计进行资源分配。部署后,系统监控实际性能,并通过第二个神经网络利用这些数据优化其预测。更新后的模型不仅能改进当前硬件的估计,还能提升对未分配硬件及未观测共置场景的预测精度。最终形成一种自适应迭代方法,通过持续学习在异构深度学习集群中做出更有效的资源分配决策。