We present an optimized single-precision implementation of the Sparse Approximate Matrix Multiply (\SpAMM{}) [M. Challacombe and N. Bock, arXiv {\bf 1011.3534} (2010)], a fast algorithm for matrix-matrix multiplication for matrices with decay that achieves an $\mathcal{O} (n \log n)$ computational complexity with respect to matrix dimension $n$. We find that the max norm of the error achieved with a \SpAMM{} tolerance below $2 \times 10^{-8}$ is lower than that of the single-precision {\tt SGEMM} for dense quantum chemical matrices, while outperforming {\tt SGEMM} with a cross-over already for small matrices ($n \sim 1000$). Relative to naive implementations of \SpAMM{} using Intel's Math Kernel Library ({\tt MKL}) or AMD's Core Math Library ({\tt ACML}), our optimized version is found to be significantly faster. Detailed performance comparisons are made for quantum chemical matrices with differently structured sub-blocks. Finally, we discuss the potential of improved hardware prefetch to yield 2--3x speedups.
翻译:我们展示了一个优化的单精度执行“粗缩缩缩缩缩缩表(SpAM)”[M. Challacombe和N. Bock, arxiv {bf 1011.3534}(2010)],一个用于衰变矩阵的矩阵矩阵-矩阵乘法快速算法,该矩阵的计算复杂性已经达到$\mathcal{O} (n\log n) 美元。我们发现,与使用Intel's Math Kernel 图书馆(tt MKL}) 或AMD核心数学图书馆(tt ACM) 实现的错误的最大标准相比,比用于密集量化学矩阵的单精密缩略缩缩缩和 N. t SGEMM} 低一个快速算法,而对于小矩阵则已经达到$\sim 1000美元(n\simmall n) (n\log nn) 。