We provide a queueing-theoretic framework for job replication schemes based on the principle "\emph{replicate a job as soon as the system detects it as a \emph{straggler}}". This is called job \emph{speculation}. Recent works have analyzed {replication} on arrival, which we refer to as \emph{replication}. Replication is motivated by its implementation in Google's BigTable. However, systems such as Apache Spark and Hadoop MapReduce implement speculative job execution. The performance and optimization of speculative job execution is not well understood. To this end, we propose a queueing network model for load balancing where each server can speculate on the execution time of a job. Specifically, each job is initially assigned to a single server by a frontend dispatcher. Then, when its execution begins, the server sets a timeout. If the job completes before the timeout, it leaves the network, otherwise the job is terminated and relaunched or resumed at another server where it will complete. We provide a necessary and sufficient condition for the stability of speculative queueing networks with heterogeneous servers, general job sizes and scheduling disciplines. We find that speculation can increase the stability region of the network when compared with standard load balancing models and replication schemes. We provide general conditions under which timeouts increase the size of the stability region and derive a formula for the optimal speculation time, i.e., the timeout that minimizes the load induced through speculation. We compare speculation with redundant-$d$ and redundant-to-idle-queue-$d$ rules under an $S\& X$ model. For light loaded systems, redundancy schemes provide better response times. However, for moderate to heavy loadings, redundancy schemes can lose capacity and have markedly worse response times when compared with a speculative scheme.
翻译:我们根据“\ emph{strggler} ” 的原则,为职位复制计划提供一个队列理论框架。 这叫做任务 \ emph{ sperggler} 。 最近的工作分析到抵达时的{recoms}, 我们称之为 emph{ recredition} 。 复制的动力在于谷歌的大表。 但是, 诸如 Apache Spark 和 Hadoop MapReduce 等系统, 一旦系统检测到一个任务执行时间, 就会实施投机性的工作执行。 投机性工作执行的性能和优化并没有得到很好的理解。 为此, 我们建议为每个服务器的负载平衡建立一个队列模式模式模式, 从而可以对执行任务的时间进行猜测。 具体地, 每个任务最初由前端调度员指派给一个单一的服务器。 然后, 当任务完成时, 服务器会设置一个超时, 如果任务完成超时, 它会失去这个模式, 工作会结束, 或者重新启用或恢复到另一个服务器的折旧性能完成的。 我们提供了一个必要的和足够的条件 。