We present tritonBLAS, a fast and deterministic analytical model that uses architectural parameters like the cache hierarchy, and relative code and data placement to generate performant GPU GEMM kernels. tritonBLAS explicitly models the relationship between architectural topology, matrix shapes, and algorithmic blocking behavior to predict near-optimal configurations without runtime autotuning. Based on this model, we developed and implemented a lightweight GEMM framework entirely within Triton. We evaluate the performance of tritonBLAS across a diverse set of GEMM problem sizes on modern GPUs. tritonBLAS achieves over 95% of the performance of autotuning solutions, while reducing autotuning time to zero. This makes tritonBLAS a practical drop-in replacement for empirical tuning in production HPC and ML workloads.
翻译:本文提出tritonBLAS,一种快速且确定性的解析模型,该模型利用缓存层次结构等架构参数以及代码与数据的相对布局来生成高性能的GPU通用矩阵乘法(GEMM)内核。tritonBLAS显式地建模了架构拓扑、矩阵形状与算法分块行为之间的关系,从而无需运行时自动调优即可预测接近最优的配置。基于此模型,我们完全在Triton框架内开发并实现了一个轻量级GEMM框架。我们在现代GPU上针对多种GEMM问题规模评估了tritonBLAS的性能。实验表明,tritonBLAS能达到自动调优方案95%以上的性能,同时将自动调优时间降为零。这使得tritonBLAS成为生产环境中高性能计算(HPC)和机器学习(ML)工作负载中经验性调优的实用即插即用替代方案。