Fine-grained workload and resource balancing is the key to high performance for regular and irregular computations on the GPUs. In this dissertation, we conduct an extensive survey of existing load-balancing techniques to build an abstraction that addresses the difficulty of scheduling computations on the GPU. We propose a GPU fine-grained load-balancing abstraction that decouples load balancing from work processing and aims to support both static and dynamic schedules with a programmable interface to implement new load-balancing schedules. Prior to our work, the only way to unleash the GPU's potential on irregular problems has been to workload-balance through application-specific, tightly coupled load-balancing techniques. With our open-source framework for load-balancing, we hope to improve programmers' productivity when developing irregular-parallel algorithms on the GPU, and also improve the overall performance characteristics for such applications by allowing a quick path to experimentation with a variety of existing load-balancing techniques. Using our insights from load-balancing irregular workloads, we build Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an even share of the aggregate inner loop iterations among physical processing elements. This provides a near-perfect utilization of computing resources, regardless of how efficiently the output tiling for any given problem quantizes across the underlying processing elements. On GPU processors, our Stream-K parallelization of GEMM produces a peak speedup of up to 14x and 6.7x, and an average performance response that is both higher and more consistent across 32K GEMM problem geometries than state-of-the-art math libraries such as CUTLASS and cuBLAS.
翻译:精制工作量和资源平衡是正常和不定期计算 GPU 的关键。 在这项论文中,我们对现有负负平衡技术进行广泛调查,以建立解决 GPU 计算时间安排困难的抽象化。 我们建议GPU 微制负平衡抽象化,以一个可编程的界面支持静态和动态时间表,以实施新的负平衡时间表。 在我们工作之前,在非正常问题上释放GPU潜力的唯一方式是通过应用程序特定、紧密结合的负平衡技术实现工作量平衡。 由于我们为负平衡计算制定更难的计算方法,我们希望提高程序员的生产率,在GPU上制定非负平衡算算算法,同时通过快速尝试各种基于轨迹的工作平衡技术来实施新的负负负平衡时间表。 利用我们从负重平缓冲的正常工作周期性工作量,我们建立SDream-K, 平坦调平坦基平坦基平压的当前平流处理流程, 以高压的平坦基平基平基平流的平流、平坦基平流的平流的平流法计算方式, 将我们的平流、平流、平坦的平流、平流、平流的平流的平流、平流的平流、平流的平流的平流、平流、平流的平流、平流的平流的平流、平流、平流的平流的平流的平流的平流的平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流的平流、平流、平流的平基的计算、平基的平基的平流、平流的平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流的、平流、平流的、平的、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平流、平