Matrix decompositions are ubiquitous in machine learning, including applications in dimensionality reduction, data compression and deep learning algorithms. Typical solutions for matrix decompositions have polynomial complexity which significantly increases their computational cost and time. In this work, we leverage efficient processing operations that can be run in parallel on modern Graphical Processing Units (GPUs), predominant computing architecture used e.g. in deep learning, to reduce the computational burden of computing matrix decompositions. More specifically, we reformulate the randomized decomposition problem to incorporate fast matrix multiplication operations (BLAS-3) as building blocks. We show that this formulation, combined with fast random number generators, allows to fully exploit the potential of parallel processing implemented in GPUs. Our extensive evaluation confirms the superiority of this approach over the competing methods and we release the results of this research as a part of the official CUDA implementation (https://docs.nvidia.com/cuda/cusolver/index.html).
翻译:矩阵分解在机器学习中无处不在,其中包括在维度减低、数据压缩和深层学习算法方面的应用。矩阵分解的典型解决方案具有多元复杂性,大大增加了它们的计算成本和时间。在这项工作中,我们利用高效的处理操作,在现代图形处理器(GPUs)上平行运行,这是主要计算机结构,例如用于深层学习,以减少计算矩阵分解的计算负担。更具体地说,我们重新配置随机拆解问题,将快速矩阵增殖操作(BLAS-3)作为建筑块。我们表明,这种配方加上快速随机生成器,能够充分利用在GPUs实施的平行处理的潜力。我们的广泛评价证实,这一方法优于相互竞争的方法,我们公布这一研究成果,作为正式实施CUDA(https://docs.nvidia.com/cuda/cusolver/index.html)的一部分。