The explosive growth of AI applications has created unprecedented demand for GPU resources. Cloud providers meet this demand through GPU-as-a-Service platforms that offer rentable GPU resources for running AI workloads. In this context, the sharing of GPU resources between different tenants is essential to maximize the number of scheduled workloads. Among the various GPU sharing technologies, NVIDIA's Multi-Instance GPU (MIG) stands out by partitioning GPUs at hardware level into isolated slices with dedicated compute and memory, ensuring strong tenant isolation, preventing resource contention, and enhancing security. Despite these advantages, MIG's fixed partitioning introduces scheduling rigidity, leading to severe GPU fragmentation in multi-tenant environments, where workloads are continuously deployed and terminated. Fragmentation leaves GPUs underutilized, limiting the number of workloads that can be accommodated. To overcome this challenge, we propose a novel scheduling framework for MIG-based clouds that maximizes workload acceptance while mitigating fragmentation in an online, workload-agnostic setting. We introduce a fragmentation metric to quantify resource inefficiency and guide allocation decisions. Building on this metric, our greedy scheduling algorithm selects GPUs and MIG slices that minimize fragmentation growth for each incoming workload. We evaluate our approach against multiple baseline strategies under diverse workload distributions. Results demonstrate that our method consistently achieves higher workload acceptance rates, leading to an average 10% increase in the number of scheduled workloads in heavy load conditions, while using approximately the same number of GPUs as the benchmark methods.
翻译:人工智能应用的爆炸式增长催生了前所未有的GPU资源需求。云服务提供商通过GPU即服务平台提供可租用的GPU资源以运行AI工作负载。在此背景下,不同租户间共享GPU资源对于最大化调度工作负载数量至关重要。在众多GPU共享技术中,英伟达的多实例GPU(MIG)通过硬件层面将GPU划分为具有专用计算单元和内存的隔离切片,实现了强租户隔离、避免资源争用并提升安全性。尽管具备这些优势,MIG的固定分区机制导致调度僵化,在多租户环境中因工作负载持续部署与终止引发严重的GPU碎片化问题。碎片化导致GPU利用率低下,限制了可容纳的工作负载数量。为应对这一挑战,我们提出一种面向MIG云环境的新型调度框架,在在线且工作负载无关的场景下最大化工作负载接受率并缓解碎片化。我们引入碎片化度量指标以量化资源低效性并指导分配决策。基于该指标,我们的贪心调度算法为每个到达的工作负载选择能最小化碎片化增长的GPU及MIG切片。我们在多样化工作负载分布下将本方法与多种基线策略进行对比评估。结果表明,本方法始终实现更高的工作负载接受率,在重负载条件下调度工作负载数量平均提升10%,同时使用的GPU数量与基准方法大致相当。