Expressions that involve matrices and vectors, known as linear algebra expressions, are commonly evaluated through a sequence of invocations to highly optimised kernels provided in libraries such as BLAS and LAPACK. A sequence of kernels represents an algorithm, and in general, because of associativity, algebraic identities, and multiple kernels, one expression can be evaluated via many different algorithms. These algorithms are all mathematically equivalent (i.e., in exact arithmetic, they all compute the same result), but often differ noticeably in terms of execution time. When faced with a decision, high-level languages, libraries, and tools such as Julia, Armadillo, and Linnea choose by selecting the algorithm that minimises the FLOP count. In this paper, we test the validity of the FLOP count as a discriminant for dense linear algebra algorithms, analysing "anomalies": problem instances for which the fastest algorithm does not perform the least number of FLOPs. To do so, we focused on relatively simple expressions and analysed when and why anomalies occurred. We found that anomalies exist and tend to cluster into large contiguous regions. For one expression anomalies were rare, whereas for the other they were abundant. We conclude that FLOPs is not a sufficiently dependable discriminant even when building algorithms with highly optimised kernels. Plus, most of the anomalies remained as such even after filtering out the inter-kernel cache effects. We conjecture that combining FLOP counts with kernel performance models will significantly improve our ability to choose optimal algorithms.
翻译:涉及矩阵和矢量表达式的表达式( 被称为线性代数表达式) 通常通过一系列的引用来评价, 包括BLAS 和 LAPACK 等图书馆提供的高度优化的内核。 一个内核序列代表一种算法, 一般来说, 因为关联性、 代数特性和多个内核, 一种表达式可以通过多种不同的算法来评价。 这些算法都是数学等同的( 即精确算术中, 它们都计算出相同的结果), 但在执行时间方面往往有明显差异。 当面对一个决定时, 高层次的语言、 图书馆和工具, 如 Julia、 Armadillo 和 Linnea 等, 选择一种算法, 以最小化 FLOP 计数。 在本文中, 我们测试 FLOP 计数的有效性, 分析“ 异常值 ” : 最快速的算法并不产生最低的内值效果。 如此, 我们注重相对简单的表达式, 也就是, 我们发现, 最稳定的表达式会变得非常稳定。