Distributed computing enables large-scale computation tasks to be processed over multiple workers in parallel. However, the randomness of communication and computation delays across workers causes the straggler effect, which may degrade the performance. Coded computation helps to mitigate the straggler effect, but the amount of redundant load and their assignment to the workers should be carefully optimized. In this work, we consider a multi-master heterogeneous-worker distributed computing scenario, where multiple matrix multiplication tasks are encoded and allocated to workers for parallel computation. The goal is to minimize the communication plus computation delay of the slowest task. We propose worker assignment, resource allocation and load allocation algorithms under both dedicated and fractional worker assignment policies, where each worker can process the encoded tasks of either a single master or multiple masters, respectively. Then, the non-convex delay minimization problem is solved by employing the Markov's inequality-based approximation, Karush-Kuhn-Tucker conditions, and successive convex approximation methods. Through extensive simulations, we show that the proposed algorithms can reduce the task completion delay compared to the benchmarks, and observe that dedicated and fractional worker assignment policies have different scopes of applications.
翻译:分布式计算可以同时处理多个工人的大规模计算任务。 然而, 沟通和计算延误的随机性导致工人的分流效应, 这会降低业绩。 编码计算有助于减轻分流效应, 但冗余负荷量和分配给工人的量应谨慎优化。 在此工作中, 我们考虑多种技术主管的多种工作分布计算假设, 将多重矩阵的倍增任务编码并分配给工人进行平行计算。 目标是最大限度地减少通信与计算最慢任务之间的延迟。 我们根据专门的和分数的工人任务分配政策提出工人分配、 资源分配和工作量分配算法, 让每个工人分别处理单个主人或多个主人的编码任务。 然后, 通过使用Markov的基于不平等的近距离、 Karush- Kuhn- Tucker 条件和连续的convex 近似方法, 解决了非convex 的延迟问题。 我们通过广泛的模拟, 显示拟议的算法可以比基准减少任务完成时间的延迟, 并观察专门的和分数工人应用范围。