Machine learning (ML) inference serving systems can schedule requests to improve GPU utilization and to meet service level objectives (SLOs) or deadlines. However, improving GPU utilization may compromise latency-sensitive scheduling, as concurrent tasks contend for GPU resources and thereby introduce interference. Given that interference effects introduce unpredictability in scheduling, neglecting them may compromise SLO or deadline satisfaction. Nevertheless, existing interference prediction approaches remain limited in several respects, which may restrict their usefulness for scheduling. First, they are often coarse-grained, which ignores runtime co-location dynamics and thus restricts their accuracy in interference prediction. Second, they tend to use a static prediction model, which may not effectively cope with different workload characteristics. To this end, we evaluate the potential limitations of existing interference prediction approaches and outline our ongoing work toward achieving efficient ML inference scheduling.
翻译:机器学习(ML)推理服务系统可通过调度请求来提高GPU利用率,并满足服务水平目标(SLO)或截止时间要求。然而,提高GPU利用率可能会损害对延迟敏感的调度,因为并发任务会竞争GPU资源,从而引入干扰。鉴于干扰效应会给调度带来不可预测性,忽视它们可能会影响SLO或截止时间的满足。尽管如此,现有的干扰预测方法在若干方面仍存在局限,这可能限制其在调度中的实用性。首先,这些方法通常是粗粒度的,忽略了运行时共置的动态特性,从而限制了干扰预测的准确性。其次,它们倾向于使用静态预测模型,可能无法有效应对不同的工作负载特征。为此,我们评估了现有干扰预测方法的潜在局限性,并概述了我们在实现高效机器学习推理调度方面的持续工作。