In recent years, general matrix-matrix multiplication with non-regular-shaped input matrices has been widely used in many applications like deep learning and has drawn more and more attention. However, conventional implementations are not suited for non-regular-shaped matrix-matrix multiplications, and few works focus on optimizing tall-and-skinny matrix-matrix multiplication on CPUs. This paper proposes an auto-tuning framework, AutoTSMM, to build high-performance tall-and-skinny matrix-matrix multiplication. AutoTSMM selects the optimal inner kernels in the install-time stage and generates an execution plan for the pre-pack tall-and-skinny matrix-matrix multiplication in the runtime stage. Experiments demonstrate that AutoTSMM achieves competitive performance comparing to state-of-the-art tall-and-skinny matrix-matrix multiplication. And, it outperforms all conventional matrix-matrix multiplication implementations.
翻译:近年来,在深层学习等许多应用中,普遍矩阵矩阵与非常规输入矩阵的乘法被广泛使用,吸引了越来越多的注意力。然而,常规实施并不适合非常规型矩阵矩阵矩阵乘法,很少有工作侧重于优化高空空基基矩阵对CPU的乘法乘法。本文建议建立一个自动调控框架AutotSMMM,以构建高性能高和短性基矩阵乘法。AutTSMM在安装时阶段选择了最佳内核,并生成了运行时段预装高空基矩阵乘法倍化的执行计划。实验显示AutSMMMM取得了与最新高空基层矩阵矩阵乘法乘法相匹配的竞争性性能。它超越了所有常规的矩阵矩阵矩阵倍化实施。