Tile low rank representations of dense matrices partition them into blocks of roughly uniform size, where each off-diagonal tile is compressed and stored as its own low rank factorization. They offer an attractive representation for many data-sparse dense operators that appear in practical applications, where substantial compression and a much smaller memory footprint can be achieved. TLR matrices are a compromise between the simplicity of a regular perfectly-strided data structure and the optimal complexity of the unbalanced trees of hierarchically low rank matrices, and provide a convenient performance-tuning parameter through their tile size that can be proportioned to take into account the cache size where the tiles reside in the memory hierarchy. There are currently no high-performance algorithms that can generate Cholesky and $LDL^T$ factorizations, particularly on GPUs. The difficulties in achieving high performance when factoring TLR matrices come from the expensive compression operations that must be performed during the factorization process and the adaptive rank distribution of the tiles that causes an irregular work pattern for the processing cores. In this work, we develop a dynamic batching operation and combine it with batched adaptive randomized approximations to achieve high performance both on GPUs and CPUs. Our implementation attains over 1.2 TFLOP/s in double precision on the V100 GPU, and is limited by the performance of batched GEMM operations. The Cholesky factorization of covariance matrix of size $N = 131K$ arising in spatial statistics can be factored to an accuracy $\epsilon=10^{-2}$ in just a few seconds. We believe the proposed GEMM-centric algorithm allows it to be readily ported to newer hardware such as the tensor cores that are optimized for small GEMM operations.
翻译:厚厚基质的低位表示面将它们分隔成大致统一大小的区块, 每一非直径平面的瓷砖被压缩并存储为其本身的低阶因子化。 对于许多在实际应用中出现的数据偏密密的操作者来说,它们提供了吸引的表示面, 在那里可以实现大量压缩和少得多的内存足足足足足。 TLR 矩阵是一种妥协, 一方面是正常的完美数据结构的简单性,另一方面是等级低基质的不平衡树的最佳复杂性, 并且通过其牌子大小提供一个方便的性能调控参数, 后者可以与内存等级等级等级分级的磁盘级精度相匹配。 目前没有高性能的算法, 可以产生Colesky和$LDLT的因子化, 特别是在 GPLR= 精度中。 当在因子化过程中必须完成的昂贵的压缩操作时, 当磁盘调等级分布导致处理核心的不规则性工作模式时, 我们开发一个动态的批量化操作, 将它与C- IM- IM 的精度平级平流化的硬化的硬化操作在C- 的硬化的C- 的硬化运行中可以实现高压的硬化的硬化硬化的硬化的硬化的硬化的硬化的硬化的硬化的硬化的硬化的硬化的硬化的硬化的硬化的基质化的基质化的基质化的基质化, 。