现代多核心CPU中大型密度密度强的电压器的低级电压SVD(TT-SVD)性能 (Performance of the low-rank tensor-train SVD (TT-SVD) for large dense tensors on modern multi-core CPUs)

There are several factorizations of multi-dimensional tensors into lower-dimensional components, known as `tensor networks'. We consider the popular `tensor-train' (TT) format and ask: How efficiently can we compute a low-rank approximation from a full tensor on current multi-core CPUs? Compared to sparse and dense linear algebra, kernel libraries for multi-linear algebra are rare and typically not as well optimized. Linear algebra libraries like BLAS and LAPACK may provide the required operations in principle, but often at the cost of additional data movements for rearranging memory layouts. Furthermore, these libraries are typically optimized for the compute-bound case (e.g.\ square matrix operations) whereas low-rank tensor decompositions lead to memory bandwidth limited operations. We propose a `tensor-train singular value decomposition' (TT-SVD) algorithm based on two building blocks: a `Q-less tall-skinny QR' factorization, and a fused tall-skinny matrix-matrix multiplication and reshape operation. We analyze the performance of the resulting TT-SVD algorithm using the Roofline performance model. In addition, we present performance results for different algorithmic variants for shared-memory as well as distributed-memory architectures. Our experiments show that commonly used TT-SVD implementations suffer severe performance penalties. We conclude that a dedicated library for tensor factorization kernels would benefit the community: Computing a low-rank approximation can be as cheap as reading the data twice from main memory. As a consequence, an implementation that achieves realistic performance will move the limit at which one has to resort to randomized methods that only process part of the data.

翻译：将多维电解器数解成低维构件, 称为“ 电磁网络 ” 。我们考虑流行的“ 电磁列” (TT) 格式, 并询问 : 与稀疏和稠密的线性代数相比, 多线性代数的内核库是罕见的, 通常不是最优化的。 BLAS 和 LAPACK 等线性变代数库可能原则上提供所需的操作, 但往往以后置存储布局的额外数据移动成本为代价。此外, 这些图书馆通常能以最优化的方式从当前多核心的 CPU 中从全电磁力推出一个低端近端近端近端近端近端近端近端, 而低层的电离子库库库库中则导致记忆带带带带宽的操作。我们建议“ 超高空图书馆- QR 系数化” 算算法只能用于两个建筑块: Q- lott- taking QR 递解算法的精度变现精度运算算结果, 我们的常规矩阵性矩阵性变化矩阵性变变变换数据, 将产生一个常规变形变换数据。