Sparse compiler is a promising solution for sparse tensor algebra optimization. In compiler implementation, reduction in sparse-dense hybrid algebra plays a key role in performance. Though GPU provides various reduction semantics that can better utilize the parallel computing and memory bandwidth capacity, the central question is: how to elevate the flexible reduction semantics to sparse compilation theory that assumes serial execution. Specifically, we have to tackle two main challenges: (1) there are wasted parallelism by adopting static synchronization granularity (2) static reduction strategy limits optimization space exploration. We propose Sgap: segment group and atomic parallelism to solve these problems. Atomic parallelism captures the flexible reduction semantics to systematically analyze the optimization space of sparse-dense hybrid algebra on GPU. It is a new optimization technique beyond current compiler-based and open-source runtime libraries. Segment group elevates the flexible reduction semantics to suitable levels of abstraction in the sparse compilation theory. It adopts changeable group size and user-defined reduction strategy to solve challenge (1) and (2), respectively. Finally, we use GPU sparse matrix-matrix multiplication (SpMM) on the TACO compiler as a use case to demonstrate the effectiveness of segment group in reduction semantics elevation. We achieve up to 1.2x speedup over the original TACO's SpMM kernels. We also apply new optimization techniques found by atomic parallelism to an open-source state-of-the-art SpMM library dgSPARSE. We achieve 1.6x - 2.3x speedup on the algorithm tuned with atomic parallelism.
翻译:粗略的编译器是稀薄的高温代数优化的一个很有希望的解决方案。 在编译器实施中, 减少稀薄的高温混合代数在性能中发挥着关键作用。 虽然 GPU 提供了各种减少语义, 能够更好地利用平行计算和记忆带宽能力, 但中心问题是: 如何将灵活的减少语义提升到稀薄的编译理论, 假设序列执行。 具体地说, 我们必须应对两大挑战:(1) 采用静态同步颗粒度, 静态削减战略, 静态同步颗粒度, 静态削减战略, 限制空间探索。 我们建议 Sgap : 分块组和原子平行, 解决这些问题。 原子平行主义捕捉灵活减少语义, 系统分析稀薄的混合代数代数在 GPUPS 上最优化的空间。 这是一个新的优化技术, 将灵活的减少语义缩放语义提升到 IMMRA 上, 将原始的缩略图集- 缩略图的缩略图用于 IMR 的缩略图。