Coded computation techniques provide robustness against straggling workers in distributed computing. However, most of the existing schemes require exact provisioning of the straggling behaviour and ignore the computations carried out by straggling workers. Moreover, these schemes are typically designed to recover the desired computation results accurately, while in many machine learning and iterative optimization algorithms, faster approximate solutions are known to result in an improvement in the overall convergence time. In this paper, we first introduce a novel coded matrix-vector multiplication scheme, called coded computation with partial recovery (CCPR), which benefits from the advantages of both coded and uncoded computation schemes, and reduces both the computation time and the decoding complexity by allowing a trade-off between the accuracy and the speed of computation. We then extend this approach to distributed implementation of more general computation tasks by proposing a coded communication scheme with partial recovery, where the results of subtasks computed by the workers are coded before being communicated. Numerical simulations on a large linear regression task confirm the benefits of the proposed distributed computation scheme with partial recovery in terms of the trade-off between the computation accuracy and latency.
翻译:代码化计算技术可以防止分布式计算中工人的分流,但是,大多数现有计算方法要求严格提供分流行为,忽视分层工人的计算方法;此外,这些方法的设计一般是为了准确回收预期的计算结果,而在许多机器学习和迭代优化算法中,已知更快的近似解决办法可以改善总体趋同时间。在本文中,我们首先采用了一个新的编码矩阵-矢量倍增方案,称为代码化计算,部分回收(CCPR),从编码和未编码的计算方法的优点中获益,并允许在计算准确性和速度之间进行权衡,从而减少计算时间和分解的复杂程度。然后,我们扩大这一方法,通过提出部分回收的编码通信方法来分配更一般的计算任务,在进行部分回收之前对工人所计算的子任务的结果进行编码。关于大型线性回归任务的数值模拟证实了拟议的分配计算方法的效益,在计算精确度与纬度之间的交易方面部分回收。