加载对相邻微散矩阵分割区平衡的通讯加负载 (Load Plus Communication Balancing of Contiguous Sparse Matrix Partitions)

from arxiv, 19 pages; added experimental results, added lazy near-linear bisection algorithm, clarified asymptotic guarantees, simplified presentation of parametric search algorithms, revised and reformatted to fit page limits. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

We partition to parallelize multiplication of one or more dense vectors by a sparse matrix (SpMV or SpMM). We consider contiguous partitions, where the rows (or columns) of a sparse matrix with $N$ nonzeros are split into $K$ parts without reordering. We propose exact and approximate contiguous partitioners that minimize the maximum runtime of any processor under a diverse family of cost models that combine work and hypergraph communication terms in symmetric or asymmetric settings. This differs from traditional partitioning models which minimize total communication, or from traditional load balancing models which only balance work. One can view our algorithms as optimally rounding one-dimensional embeddings of direct $K$-way noncontiguous partitioning problems. Our algorithms use linear space. Our exact algorithm runs in linear time when $K^2$ is $O(N^C)$ for $C < 1$. Our $(1 + \epsilon)$-approximate algorithm runs in linear time when $K\log(c_{\text{high}}/(c_{\text{low}}\epsilon))$ is $O(N^C)$ for $C < 1$, where $c_{\text{high}}$ and $c_{\text{low}}$ are upper and lower bounds on the optimal cost. We also propose a simpler $(1 + \epsilon)$-approximate algorithm which runs in a factor of $\log(c_{\text{high}}/(c_{\text{low}}\epsilon))$ from linear time, but is faster in practice. We empirically demonstrate that all of our algorithms efficiently produce high-quality contiguous partitions.

翻译：我们通过一个稀薄的矩阵({{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{}或SpmM})将一个或多个密度的矢量的平行分区,在不重新排序的情况下,将一个含有美元的非零度的稀薄矩阵的行(或列)分割成美元元件。我们建议精确和大致毗连的分隔器,在一系列成本模型下,将任何处理器的最大运行时间最小化为K美元;在对称或对称的设置中,将一个或更多一个或更多密度的矢量的矢量({{{{{{{{或列),将我们的算法视为最优化的一维度嵌入美元中,而无需重新排序。当美元为美元时,我们精确的算法将线性运行时间为美元=1美元(N}美元。当我们1+ 时间-lumalxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx