Linear algebra expressions, which play a central role in countless scientific computations, are often computed via a sequence of calls to existing libraries of building blocks (such as those provided by BLAS and LAPACK). A sequence identifies a computing strategy, i.e., an algorithm, and normally for one linear algebra expression many alternative algorithms exist. Although mathematically equivalent, those algorithms might exhibit significant differences in terms of performance. Several high-level languages and tools for matrix computations such as Julia, Armadillo, Linnea, etc., make algorithmic choices by minimizing the number of Floating Point Operations (FLOPs). However, there can be several algorithms that share the same (or have nearly identical) number of FLOPs; in many cases, these algorithms exhibit execution times which are statistically equivalent and one could arbitrarily select one of them as the best algorithm. It is however not unlikely to find cases where the execution times are significantly different from one another (despite the FLOP count being almost the same). It is also possible that the algorithm that minimizes FLOPs is not the one that minimizes execution time. In this work, we develop a methodology to test the reliability of FLOPs as discriminant for linear algebra algorithms. Given a set of algorithms (for an instance of a linear algebra expression) as input, the methodology ranks them into performance classes; i.e., multiple algorithms are allowed to share the same rank. To this end, we measure the algorithms iteratively until the changes in the ranks converge to a value close to zero. FLOPs are a valid discriminant for an instance if all the algorithms with minimum FLOPs are assigned the best rank; otherwise, the instance is regarded as an anomaly, which can then be used in the investigation of the root cause of performance differences.
翻译:在无数科学计算中发挥着核心作用的线性代数表达式,通常通过向现有建筑群库(例如由BLAS和LAPACK提供)的调用顺序来计算。一个序列可以确定一个计算策略,即算法,通常是一个线性代数表达式,许多替代算法存在。虽然在数学上等同,这些算法在性能方面可能表现出很大的差异。一些高层次语言和矩阵计算工具,如Julia、Armadillo、Linnea等,通过最小化浮点操作数量(FLOPs 和 LAPACKs 提供) 来做出算法选择。然而,有些算法可能将FLOPs最小化为最小值(或几乎完全相同) FLOPs 最小化的数;这些算法显示执行时间与 FLOPralal 的比值是最小化的。在FLOPsalrial 的排序中,这种算法是最小化的。