In this paper, we introduce the Variable Coded Distributed Batch Matrix Multiplication (VCDBMM) problem which tasks a distributed system to perform batch matrix multiplication where matrices are not necessarily distinct among batch jobs. Most coded matrix-matrix computation work has broadly focused in two directions: matrix partitioning for computing a single computation task and batch processing of multiple distinct computation tasks. While these works provide codes with good straggler resilience and fast decoding for their problem spaces, these codes would not be able to take advantage of the natural redundancy of re-using matrices across batch jobs. Inspired by Cross-Subspace Alignment codes, we develop Flexible Cross-Subspace Alignments (FCSA) codes that are flexible enough to utilize this redundancy. We provide a full characterization of FCSA codes which allow for a wide variety of system complexities including good straggler resilience and fast decoding. We theoretically demonstrate that, under certain practical conditions, FCSA codes are within a factor of two of the optimal solution when it comes to straggler resilience; our simulations demonstrate that our codes achieve even better optimality gaps in practice.
翻译:在本文中,我们引入了可变代码分布批量矩阵乘数问题, 要求一个分布式系统在批量工作不一定不同的情况下进行批量矩阵乘数。 大多数编码矩阵矩阵计算工作大致集中于两个方向: 用于计算单项计算任务的矩阵分割和多种不同计算任务的批量处理。 虽然这些工作为代码提供了具有良好的排散弹性和问题空间快速解码功能的代码,但这些代码将无法利用在批量工作中重新使用矩阵的自然冗余。 在跨空间协调代码的启发下, 我们开发了足够灵活的灵活跨空间比对代码( FCSA ), 以便使用这种冗余。 我们为 FCSA 代码提供了全面的特征描述, 允许使用多种系统复杂性, 包括良好的strggler弹性和快速解码。 我们理论上证明, 在某些实际条件下, FCSA 代码在排散性适应能力方面属于两种最佳解决方案的一个要素。 我们的模拟表明, 我们的代码在实际操作中达到了更好的最佳差距。