General Matrix Multiplication (GEMM) has a wide range of applications in scientific simulation and artificial intelligence. Although traditional libraries can achieve high performance on large regular-shaped GEMMs, they often behave not well on irregular-shaped GEMMs, which are often found in new algorithms and applications of high-performance computing (HPC). Due to energy efficiency constraints, low-power multi-core digital signal processors (DSPs) have become an alternative architecture in HPC systems. Targeting multi-core DSPs in FT-m7032, a prototype CPU-DSPs heterogeneous processor for HPC, an efficient implementation - ftIMM - for three types of irregular-shaped GEMMs is proposed. FtIMM supports automatic generation of assembly micro-kernels, two parallelization strategies, and auto-tuning of block sizes and parallelization strategies. The experiments show that ftIMM can get better performance than the traditional GEMM implementations on multi-core DSPs in FT-m7032, yielding on up to 7.2x performance improvement, when performing on irregular-shaped GEMMs. And ftIMM on multi-core DSPs can also far outperform the open source library on multi-core CPUs in FT-m7032, delivering up to 3.1x higher efficiency.
翻译:虽然传统图书馆可以在大型正常成型的GEMM上取得高业绩,但它们往往在非正常成型的GEMM上表现不佳,这往往见于新的算法和高性能计算(HPC)的应用中。由于能源效率的限制,低功率多极数字信号处理器(DSP)已成为HPC系统中的替代结构。FT-m7032的多功能数字信号处理器原型为HPC的CPU-DSP混合处理器,为三种非正常成型的GEMM高效实施FP-MMM(FT-m7032的原型CPU-DSP(CUPS-DS)的常规实施方式比GIMM在F-m7032的多核心DSP(FMMM-FT-MMM(FFMM-FT-MMI)的多核心系统效率(FFMMMMM-FFS-FMM(FFMMM-PR-FMI-FPR-FM-FS-FS-FS-FMI-FS-FS-FS-FS-S-MLOL-FT-FT-FS-S-S-FS-FMLMLMLML)的多核心系统效率,也向FS-FS-FS-FS-FS-FMPMMMMMPMS-FS-F-F-F-F-F-FS-FS-FS-FS-FS-FS-FS-FS-FPMPMP-F-FPMPMPMPM-S-F-F-F-F-F-S-F-F-F-F-F-F-F-F-F-F-FP-S-S-MMMMMP-FPMPMP-FP-FP-FP-FP-FP-FP-FP-MP-FP-FP-FP-FP-FP-FP-FP-FP-FP-FP-FP-FP-MP-FP-FP-FP-MP