Machine learning (ML) tasks are one of the major workloads in today's edge computing networks. Existing edge-cloud schedulers allocate the requested amounts of resources to each task, falling short of best utilizing the limited edge resources for ML tasks. This paper proposes TapFinger, a distributed scheduler for edge clusters that minimizes the total completion time of ML tasks through co-optimizing task placement and fine-grained multi-resource allocation. To learn the tasks' uncertain resource sensitivity and enable distributed scheduling, we adopt multi-agent reinforcement learning (MARL) and propose several techniques to make it efficient, including a heterogeneous graph attention network as the MARL backbone, a tailored task selection phase in the actor network, and the integration of Bayes' theorem and masking schemes. We first implement a single-task scheduling version, which schedules at most one task each time. Then we generalize to the multi-task scheduling case, in which a sequence of tasks is scheduled simultaneously. Our design can mitigate the expanded decision space and yield fast convergence to optimal scheduling solutions. Extensive experiments using synthetic and test-bed ML task traces show that TapFinger can achieve up to 54.9% reduction in the average task completion time and improve resource efficiency as compared to state-of-the-art schedulers.
翻译:机器学习( ML) 任务是当今边缘计算网络的主要工作量之一 。 现有的边球调度器将所请求的资源数量分配给每个任务, 未充分利用 ML 任务有限的边际资源。 本文提议为边缘群集分配磁带Finger, 用于边缘群集的分布式调度器, 通过同步优化任务定位和细微的多任务分配, 最大限度地减少 ML 任务完成时间的完整时间 。 要了解任务的不稳定性资源敏感性, 并进行分布式调度, 我们采用多试剂加固学习( MARL), 并提出若干方法来提高效率, 包括混合图形关注网络作为 MARL 的骨干, 一个定制的任务选择阶段, 以及Bayes 的标本和掩码组合的整合。 我们首先实施一个单一任务列表版本, 每次最多安排一个任务。 然后我们推广多任务列表, 并同时安排一个任务顺序 。 我们的设计可以减少决定空间的扩大, 并实现最佳的时间安排解决方案的快速趋同 。 使用合成和试床 ML 任务完成时间轨迹进行广泛的实验, 显示磁带- F 实现平均任务完成时间的进度 。