Generalized Matrix Chains (GMCs) are products of matrices where each matrix carries features (e.g., general, symmetric, triangular, positive-definite) and is optionally transposed and/or inverted. GMCs are commonly evaluated via sequences of calls to BLAS and LAPACK kernels. When matrix sizes are known, one can craft a sequence of kernel calls to evaluate a GMC that minimizes some cost, e.g., the number of floating-point operations (FLOPs). Even in these circumstances, high-level languages and libraries, upon which users usually rely, typically perform a suboptimal mapping of the input GMC onto a sequence of kernels. In this work, we go one step beyond and consider matrix sizes to be symbolic (unknown); this changes the nature of the problem since no single sequence of kernel calls is optimal for all possible combinations of matrix sizes. We design and evaluate a code generator for GMCs with symbolic sizes that relies on multi-versioning. At compile-time, when the GMC is known but the sizes are not, code is generated for a few carefully selected sequences of kernel calls. At run-time, when sizes become known, the best generated variant for the matrix sizes at hand is selected and executed. The code generator uses new theoretical results that guarantee that the cost is within a constant factor from optimal for all matrix sizes and an empirical tuning component that further tightens the gap to optimality in practice. In experiments, we found that the increase above optimal in both FLOPs and execution time of the generated code was less than 15\% for 95\% of the tested chains.
翻译:广义矩阵链(GMCs)是指矩阵的乘积,其中每个矩阵具有特定特征(如一般性、对称性、三角性、正定性),并可选择进行转置和/或求逆。GMCs通常通过调用BLAS和LAPACK内核序列进行计算。当矩阵尺寸已知时,可以设计一个内核调用序列来评估GMC,以最小化某些成本,例如浮点运算次数(FLOPs)。即使在这种情况下,用户通常依赖的高级语言和库通常会将输入的GMC映射到一个次优的内核调用序列上。在本研究中,我们进一步考虑矩阵尺寸为符号化(未知)的情况;这改变了问题的性质,因为不存在单一的内核调用序列对所有可能的矩阵尺寸组合都是最优的。我们设计并评估了一个基于多版本化的符号化尺寸GMCs代码生成器。在编译时,当GMC已知但尺寸未知时,会为几个精心选择的内核调用序列生成代码。在运行时,当尺寸已知时,会根据当前矩阵尺寸选择并执行生成的最佳变体。该代码生成器利用了新的理论结果,确保所有矩阵尺寸下的成本均在最优解的常数因子范围内,并通过经验调优组件进一步缩小了与最优解的实际差距。在实验中,我们发现生成代码的FLOPs和执行时间相对于最优解的增量在95%的测试链中均低于15%。