We present two new algorithms for Householder QR factorization of Block Low-Rank (BLR) matrices: one that performs block-column-wise QR, and another that is based on tiled QR. We show how the block-column-wise algorithm exploits BLR structure to achieve arithmetic complexity of $\mathcal{O}(mn)$, while the tiled BLR-QR exhibits $\mathcal{O}(mn^{1.5})$ complexity. However, the tiled BLR-QR has finer task granularity that allows parallel task-based execution on shared memory systems. We compare the block-column-wise BLR-QR using fork-join parallelism with tiled BLR-QR using task-based parallelism. We also compare these two implementations of Householder BLR-QR with a block-column-wise Modified Gram-Schmidt (MGS) BLR-QR using fork-join parallelism, and a state-of-the-art vendor-optimized dense Householder QR in Intel MKL. For a matrix of size 131k $\times$ 65k, all BLR methods are more than an order of magnitude faster than the dense QR in MKL. Our methods are also robust to ill-conditioning and produce better orthogonal factors than the existing MGS-based method. On a CPU with 64 cores, our parallel tiled Householder and block-column-wise Householder algorithms show a speedup of 50 and 37 times, respectively.
翻译:我们为Croup Low-Rank(BLR)矩阵提供了两种新算法:一种是执行块式阵列QR(BLR),另一种是以平滑的 QR 为基础。我们展示了块式阵列算法结构如何利用BLR结构实现$\mathcal{O}(mn)$的计算复杂度,而平滑的BLR-QR 显示$\mathcal{O}(mn ⁇ 1.5})(美元复杂度。然而,平滑的BLR-QR 任务阵列的细数质质质使得可以在共享的存储系统上平行执行基于任务。我们用叉式阵列 BLR-QR 使用叉式阵列双边的平行法来比较块式阵列(BLR-QR) 结构结构,同时将家阵列BLR-Q(MS-R) 和MK(MK-R) 内所有硬式阵式阵式阵列的阵列方法比现在更快速。