A novel fault-tolerant computation technique based on array Belief Propagation (BP)-decodable XOR (BP-XOR) codes is proposed for distributed matrix-matrix multiplication. The proposed scheme is shown to be configurable and suited for modern hierarchical compute architectures such as Graphical Processing Units (GPUs) equipped with multiple nodes, whereby each has many small independent processing units with increased core-to-core communications. The proposed scheme is shown to outperform a few of the well--known earlier strategies in terms of total end-to-end execution time while in presence of slow nodes, called $stragglers$. This performance advantage is due to the careful design of array codes which distributes the encoding operation over the cluster (slave) nodes at the expense of increased master-slave communication. An interesting trade-off between end-to-end latency and total communication cost is precisely described. In addition, to be able to address an identified problem of scaling stragglers, an asymptotic version of array BP-XOR codes based on projection geometry is proposed at the expense of some computation overhead. A thorough latency analysis is conducted for all schemes to demonstrate that the proposed scheme achieves order-optimal computation in both the sublinear as well as the linear regimes in the size of the computed product from an end-to-end delay perspective.
翻译:在分布式矩阵矩阵矩阵矩阵配法的倍增中,提议了一种基于分数信仰促进(BP)可分XOR(BP-XOR)代码的新颖的容错计算技术。拟议的方案显示是可配置的,适合现代等级的计算结构,如配有多个节点的图形处理器(GPUs),每个单元都有许多小型的独立处理器,核心至核心通信增加,核心至核心通信增加。拟议的方案显示在总端至核心通信量方面优于几个已知的早期战略。在总端至终端执行时间方面优于少数已知的战略,同时有缓慢的节点,称为$stragglers。这一绩效优势是由于精心设计了将编码在组(slave)节点上分配编码的现代分解结构,而牺牲了更多的主控锁通信量。端对端至端连接和通信总成本之间的一个有趣的交换。此外,为了能够解决一个已查明的问题,即从预测的端到端端到端执行时间的端执行时间,即调BP-XOR计算代码的阵列的阵列代码版本,这个功能的功能是用来在投算中进行彻底的计算。提议在计算中,在计算中进行所有最后的机的计算中进行一个成本的计算方法的计算,在成本的计算中进行一个成本的计算,在成本上的计算中进行成本上的计算,在计算中进行一项费用的计算中,以成本的计算方法的计算。