COSTA: 与程序重新标签的通信- 最佳打碎和转换 (COSTA: Communication-Optimal Shuffle and Transpose Algorithm with Process Relabeling)

from arxiv, To be published in the proceedings of the 36th International Conference on High Performance Computing, ISC High Performance 2021. The implementation of the algorithm is available at: https://github.com/eth-cscs/COSTA

Communication-avoiding algorithms for Linear Algebra have become increasingly popular, in particular for distributed memory architectures. In practice, these algorithms assume that the data is already distributed in a specific way, thus making data reshuffling a key to use them. For performance reasons, a straightforward all-to-all exchange must be avoided. Here, we show that process relabeling (i.e. permuting processes in the final layout) can be used to obtain communication optimality for data reshuffling, and that it can be efficiently found by solving a Linear Assignment Problem (Maximum Weight Bipartite Perfect Matching). Based on this, we have developed a Communication-Optimal Shuffle and Transpose Algorithm (COSTA): this highly-optimised algorithm implements $A=\alpha\cdot \operatorname{op}(B) + \beta \cdot A,\ \operatorname{op} \in \{\operatorname{transpose}, \operatorname{conjugate-transpose}, \operatorname{identity}\}$ on distributed systems, where $A, B$ are matrices with potentially different (distributed) layouts and $\alpha, \beta$ are scalars. COSTA can take advantage of the communication-optimal process relabeling even for heterogeneous network topologies, where latency and bandwidth differ among nodes. The implementation not only outperforms the best available ScaLAPACK redistribute and transpose routines multiple times, but is also able to deal with more general matrix layouts, in particular it is not limited to block-cyclic layouts. Finally, we use COSTA to integrate a communication-optimal matrix multiplication algorithm into the CP2K quantum chemistry simulation package. This way, we show that COSTA can be used to unlock the full potential of recent Linear Algebra algorithms in applications by facilitating interoperability between algorithms with a wide range of data layouts, in addition to bringing significant redistribution speedups.

翻译：Linear 代数的通信比对算法越来越受欢迎, 特别是分布式内存结构。实际上, 这些算法假定数据已经以特定的方式分布, 从而将数据重新配置为使用它们的关键。出于性能原因, 必须避免一个直截了当的全到一通交换。这里, 我们显示, 重新标签的过程( 即, 最后版式中的移动进程) 可以用来获取数据重整的通信最佳性, 并且可以通过解决线性任务问题( 最大 Out- 双向流速度完美匹配 ) 来有效找到数据。基于此, 我们开发了一个通信- Opatal Shall 和 Transportal 工具( COSTA): 这种高度优化的算法可以使用 $aqalpha\ cotorname{op} (B) + abta dadal- docket A,\ sirealatename{cloadal dismologyral dismoal) 也可以将一个最高级的和最高级的系统引入。