We propose an efficient, distributed, out-of-memory implementation of the truncated singular value decomposition (t-SVD) for heterogeneous (CPU+GPU) high performance computing (HPC) systems. Various implementations of SVD have been proposed, but most only estimate the singular values as an estimation of the singular vectors which can significantly increase the time and memory complexity of the algorithm. In this work, we propose an implementation of SVD based on the power method, which is a truncated singular values and singular vectors estimation method. Memory utilization bottlenecks seen in the power method are typically associated with the computation of the Gram matrix $\mat{A}^T\mat{A}$, which can be significant when $\mat{A}$ is large and dense, or when $\mat{A}$ is super-large and sparse. The proposed implementation is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. We reduce the memory complexity of $\mat{A}^T\mat{A}$ by using a batching strategy where the intermediate factors are computed block by block. We also suppress I/O latency associated with both host-to-device (H2D) and device-to-host (D2H) batch copies by overlapping each batch copy with compute using CUDA streams. Furthermore, we use optimized \textit{NCCL} based communicators to reduce the latency associated with collective communications (both intra-node and inter-node). In addition, sparse and dense matrix multiplications are significantly accelerated with GPU cores (or tensors cores when available), resulting in an implementation with good scaling. We demonstrate the scalability of our distributed out of core SVD algorithm to successfully decompose dense matrix of size 1TB and sparse matrix of size 128PB with 1e-6 sparsity.
翻译:我们建议对混杂(CPU+GPU)高性能计算(HPC)系统采用高效、分布、超模的单值分解方法。 已经提出了实施 SVD 的各种建议, 但大多数只是将单向矢量估算为单向量, 这会大大增加算法的时间和记忆复杂性。 在这项工作中, 我们建议基于电源方法实施 SVD, 这是一种超速的单值和单向矢量估测方法。 电源方法中看到的内存利用率瓶颈通常与计算 Gram 基质( CPU+GPU) 高性能计算( CPU+GPU) 。 当$\mat{A\\T\mat{A} $是大和密度, 或者当$\mat{A} 时, 单向单向单向单向量矢量矢量矢量矢量矢量计算时, 可以显著地进行计算。 提议的执行对于超模的问题, 最优化的内向内向内存到比现有的 GPU记忆( 。 我们降低了$=D的记忆复杂性) 和直流的内基的内压 和内基 的内压 和内压的内压- 和内压的内压的内压 。