The multiplication of a matrix by its transpose, $A^T A$, appears as an intermediate operation in the solution of a wide set of problems. In this paper, we propose a new cache-oblivious algorithm (ATA) for computing this product, based upon the classical Strassen algorithm as a sub-routine. In particular, we decrease the computational cost to $\frac{2}{3}$ the time required by Strassen's algorithm, amounting to $\frac{14}{3}n^{\log_2 7}$ floating point operations. ATA works for generic rectangular matrices, and exploits the peculiar symmetry of the resulting product matrix for saving memory. In addition, we provide an extensive implementation study of ATA in a shared memory system, and extend its applicability to a distributed environment. To support our findings, we compare our algorithm with state-of-the-art solutions specialized in the computation of $A^T A$. Our experiments highlight good scalability with respect to both the matrix size and the number of involved processes, as well as favorable performance for both the parallel paradigms and the sequential implementation, when compared with other methods in the literature.
翻译:变换后的矩阵乘法( $A_T A$) 似乎是解决一系列广泛问题的一种中间操作。 在本文中, 我们基于古典Strassen 算法作为子常规, 提议一种新的缓存隐蔽算法(ATA) 来计算这一产品。 特别是, 我们将计算成本降低到 $frac{ 2 ⁇ 3} 美元, 也就是Strassen 算法所需要的时间, 即$frac{14}3}n ⁇ log_ 2 7} 浮动点操作所需的时间 。 ATA 工作为通用矩形矩阵, 并开发由此产生的产品矩阵的特殊对称以保存记忆。 此外, 我们还在共享记忆系统中对ATA进行广泛的执行研究, 并将其应用范围扩大到分布式环境。 为了支持我们的发现, 我们比较我们的算法与在计算 $A_T A$中专门使用的最新解决方案。 我们的实验显示, 在矩阵大小和所涉进程数量方面, 以及与其他文献的平行模式和连续执行方法相比, 在平行模式和连续执行中, 我们的绩效表现良好。