We develop a fused matrix multiplication kernel that unifies sampled dense-dense matrix multiplication and sparse-dense matrix multiplication under a single operation called FusedMM. By using user-defined functions, FusedMM can capture almost all computational patterns needed by popular graph embedding and GNN approaches. FusedMM is an order of magnitude faster than its equivalent kernels in Deep Graph Library. The superior performance of FusedMM comes from the low-level vectorized kernels, a suitable load balancing scheme and an efficient utilization of the memory bandwidth. FusedMM can tune its performance using a code generator and perform equally well on Intel, AMD and ARM processors. FusedMM speeds up an end-to-end graph embedding algorithm by up to 28x on different processors.
翻译:我们开发了一个集成矩阵倍增内核, 将取样的密度密集矩阵倍增和稀有密度矩阵倍增在一个称为FUTMM的单一操作下进行。 通过使用用户定义的功能, FUTMM 可以捕捉流行图形嵌入和 GNN 方法所需的几乎所有计算模式。 FUTMM 比深图库中的等效内核快得多。 FUTMM 的高级性能来自低水平的矢量内核、适当的负载平衡方案和有效使用记忆带宽。 FUTMM 可以使用代码生成器调节其性能,并在英特尔、AMD 和 ARM 处理器上同样运行良好。 FUTMM 加速一个端到端的图形嵌入算法, 在不同处理器上速度高达28x。