平衡多用户Spark工作负载中公平性与性能的动态调度策略（扩展版） (Balancing Fairness and Performance in Multi-User Spark Workloads with Dynamic Scheduling (extended version))

from arxiv, This paper is an extended version of a paper accepted at the ACM Symposium on Cloud Computing (SoCC'25) that contains a proof of correctness

Apache Spark is a widely adopted framework for large-scale data processing. However, in industrial analytics environments, Spark's built-in schedulers, such as FIFO and fair scheduling, struggle to maintain both user-level fairness and low mean response time, particularly in long-running shared applications. Existing solutions typically focus on job-level fairness which unintentionally favors users who submit more jobs. Although Spark offers a built-in fair scheduler, it lacks adaptability to dynamic user workloads and may degrade overall job performance. We present the User Weighted Fair Queuing (UWFQ) scheduler, designed to minimize job response times while ensuring equitable resource distribution across users and their respective jobs. UWFQ simulates a virtual fair queuing system and schedules jobs based on their estimated finish times under a bounded fairness model. To further address task skew and reduce priority inversions, which are common in Spark workloads, we introduce runtime partitioning, a method that dynamically refines task granularity based on expected runtime. We implement UWFQ within the Spark framework and evaluate its performance using multi-user synthetic workloads and Google cluster traces. We show that UWFQ reduces the average response time of small jobs by up to 74% compared to existing built-in Spark schedulers and to state-of-the-art fair scheduling algorithms.

翻译：Apache Spark是一种广泛应用于大规模数据处理的框架。然而，在工业分析环境中，Spark内置的调度器（如FIFO和公平调度）难以同时维持用户级公平性和较低的平均响应时间，尤其是在长期运行的共享应用程序中。现有解决方案通常侧重于作业级公平性，这无意中有利于提交更多作业的用户。尽管Spark提供了内置的公平调度器，但其缺乏对动态用户工作负载的适应性，并可能降低整体作业性能。我们提出了用户加权公平排队（UWFQ）调度器，旨在最小化作业响应时间，同时确保用户及其各自作业之间的资源公平分配。UWFQ模拟了一个虚拟公平排队系统，并在有界公平模型下基于作业的预估完成时间进行调度。为了进一步解决Spark工作负载中常见的任务倾斜和减少优先级反转问题，我们引入了运行时分区方法，该方法基于预期运行时间动态调整任务粒度。我们在Spark框架内实现了UWFQ，并使用多用户合成工作负载和Google集群跟踪数据对其性能进行了评估。结果表明，与现有的内置Spark调度器以及最先进的公平调度算法相比，UWFQ将小型作业的平均响应时间降低了高达74%。