Random projection can reduce the dimension of data while capturing its structure and is a fundamental tool for machine learning, signal processing, and information retrieval, which deal with a large amount of data today. RandNLA (Randomized Numerical Linear Algebra) leverages random projection to reduce the computational complexity of low-rank decomposition of tensors and solve least-square problems. While the computation of the random projection is a simple matrix multiplication, its asymptotic computational complexity is typically larger than other operations in a RandNLA algorithm. Therefore, various studies propose methods for reducing its computational complexity. We propose a fast mixed-precision random projection method on NVIDIA GPUs using Tensor Cores for single-precision tensors. We exploit the fact that the random matrix requires less precision, and develop a highly optimized matrix multiplication between FP32 and FP16 matrices -- SHGEMM (Single and Half-precision GEMM) -- on Tensor Cores, where the random matrix is stored in FP16. Our method can compute Randomized SVD 1.28 times faster and Random projection high order SVD 1.75 times faster than baseline single-precision implementations while maintaining accuracy.
翻译:随机投影可以在捕捉数据结构的同时减少数据的维度,是机器学习、信号处理和信息检索等领域的基础工具。RandNLA(随机数值线性代数)利用随机投影来降低张量的低秩分解和解决最小二乘问题的计算复杂度。虽然随机投影的计算是一个简单的矩阵乘法,但其渐近计算复杂性通常高于RandNLA算法中的其他操作。因此,各种研究提出了减少其计算复杂度的方法。我们提出了一种快速的混合精度随机投影方法,使用NVIDIA GPU上的Tensor Cores处理单精度张量。我们利用随机矩阵需要较低精度这一事实,并在Tensor Cores上开发了高度优化的混合精度矩阵乘法——SHGEMM(单精度和半精度GEMM),其中随机矩阵以FP16存储。我们的方法可以比基准单精度实现计算Randomized SVD快1.28倍,计算Random projection high order SVD快1.75倍,同时保持准确性。