The randomized singular value decomposition (RSVD) is by now a well established technique for efficiently computing an approximate singular value decomposition of a matrix. Building on the ideas that underpin the RSVD, the recently proposed algorithm "randUTV" computes a FULL factorization of a given matrix that provides low-rank approximations with near-optimal error. Because the bulk of randUTV is cast in terms of communication-efficient operations like matrix-matrix multiplication and unpivoted QR factorizations, it is faster than competing rank-revealing factorization methods like column pivoted QR in most high performance computational settings. In this article, optimized randUTV implementations are presented for both shared memory and distributed memory computing environments. For shared memory, randUTV is redesigned in terms of an "algorithm-by-blocks" that, together with a runtime task scheduler, eliminates bottlenecks from data synchronization points to achieve acceleration over the standard "blocked algorithm", based on a purely fork-join approach. The distributed memory implementation is based on the ScaLAPACK library. The performances of our new codes compare favorably with competing factorizations available on both shared memory and distributed memory architectures.
翻译:随机的单值分解( RSVD) 到现在, 是高效计算一个矩阵的近似单值分解( RSVD) 的成熟技术。 根据支持 RSVD 的理念, 最近提议的“ randUTV” 算法“ ” 计算了一个给定矩阵的 FLL 系数, 它提供低级近似最佳差错的低级近似近似值分解。 因为兰杜特V 大部分是用诸如矩阵- 矩阵乘法和未跳动 QR 乘数等通信高效操作来投射的, 它比竞合的排序分解因子化法( 如最高级性能计算设置的分流 QR ) 更快。 在本篇文章中, 优化的 兰杜特维 执行为共享记忆和分布式记忆计算环境。 对于共享记忆, 兰特维特V 是重新设计一个“ ALgorithm- by blocks”, 加上一个运行时间任务调度器, 消除数据同步点的瓶颈, 以达到标准的“ 阻塞算算算法”, 。, 以纯分解 方法为基础, 和可分配的存储式的存储法 基 和共享的存储法 基 。