Stream-K: GPU 高密度矩阵矩阵-矩阵矩阵乘法以工作为中心的平行分解 (Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU) - 专知论文

会员服务 ·

0

Performer · Processing（编程语言） · GPU · state-of-the-art · 查准率/准确率 ·

2023 年 1 月 9 日

Stream-K: Work-centric Parallel Decomposition for Dense Matrix-Matrix Multiplication on the GPU

翻译：Stream-K: GPU 高密度矩阵矩阵-矩阵矩阵乘法以工作为中心的平行分解

Muhammad Osama,Duane Merrill,Cris Cecka,Michael Garland,John D. Owens

from arxiv, This work previously appeared in the author's PhD dissertation, available at arXiv:2212.08964

We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an even share of the aggregate inner loop iterations among physical processing elements. This provides a near-perfect utilization of computing resources, regardless of how efficiently the output tiling for any given problem quantizes across the underlying processing elements. On GPU processors, our Stream-K parallelization of GEMM produces a peak speedup of up to 14$\times$ and 6.7$\times$, and an average performance response that is both higher and more consistent across 32,824 GEMM problem geometries than state-of-the-art math libraries such as CUTLASS and cuBLAS. Furthermore, we achieve this performance from a single tile size configuration per floating-point precision, whereas today's math libraries employ complex kernel-selection heuristics to select from a large ensemble of kernel variants.

翻译：我们引入了Stream-K, 母体倍增( GEMM) 的以工作为中心的平行, 以及相关计算为密度直线代数。当代分解主要以瓷砖为基础, 我们的方法是将总内环循环迭代在物理处理元件中的比例平均分开。这为计算资源提供了近乎完美的利用, 不论任何特定问题的产出在任何基本处理元件中如何高效地平整成数。在 GPU 处理器上, 我们的GEM 的分流- K 平行生成了高达 14 美元和 6.7 美元的峰值加速, 平均性能反应在32 824 个GEMM问题上更高且更加一致, 比CUTLASS 和 cuBLAS 等最新数学图书馆的地理分布更加一致。此外, 我们从一个浮点精确度大小的单体积配置中实现这一性能, 而今天的数学图书馆则使用复杂的内核选制, 从一个大的堆心变体中选择。

1

相关内容

Performer

宾夕法尼亚大学最新《不确定性估计》课程笔记，134页pdf，附Slides

宾夕法尼亚大学最新《不确定性估计》课程笔记，134页pdf，附Slides

专知会员服务

49+阅读 · 2022年11月13日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

专知会员服务

77+阅读 · 2020年2月8日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium5

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium5

中国图象图形学学会CSIG

1+阅读 · 2021年11月11日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

19+阅读 · 2017年12月17日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

手性畴壁对铁电薄膜电学性能调控的相场研究

国家自然科学基金

0+阅读 · 2015年12月31日

具有TRIP效应高强韧钢的亚稳奥氏体演变规律、影响因素与增塑增韧机理

国家自然科学基金

0+阅读 · 2014年12月31日

Poisson流形上的修正Hamilton方法

国家自然科学基金

0+阅读 · 2014年12月31日

CPU-GPU耦合架构下数据库连接技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

混凝土Weibull统计尺寸效应理论模型改进研究

国家自然科学基金

0+阅读 · 2013年12月31日

Kronheimer-Nakajima quiver 模空间与有理曲面

国家自然科学基金

1+阅读 · 2013年12月31日

Co/Pt多层膜反常霍尔效应中周期性振荡机理研究

国家自然科学基金

0+阅读 · 2013年12月31日

Eulerian bond-cubic 模型渗流性质的数值研究

国家自然科学基金

0+阅读 · 2012年12月31日

Al2O3和TiOx在CaO-CaF2-SiO2渣系的热力学研究

国家自然科学基金

0+阅读 · 2011年12月31日

考虑微结构随机性的三维高阶MRCT多尺度计算理论研究

国家自然科学基金

0+阅读 · 2011年12月31日

A Multiplicative Value Function for Safe and Efficient Reinforcement Learning

Arxiv

0+阅读 · 2023年3月7日

Computing Effective Resistances on Large Graphs Based on Approximate Inverse of Cholesky Factor

Arxiv

0+阅读 · 2023年3月7日

Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation

Arxiv

0+阅读 · 2023年3月6日

Multi-Order Networks for Action Unit Detection

Arxiv

0+阅读 · 2023年3月6日

PaReNTT: Low-Latency Parallel Residue Number System and NTT-Based Long Polynomial Modular Multiplication for Homomorphic Encryption

Arxiv

0+阅读 · 2023年3月3日

GPU Computation of the Euler Characteristic Curve for Imaging Data

Arxiv

0+阅读 · 2023年3月3日

Optimal scheduling of park-level integrated energy system considering ladder-type carbon trading mechanism and flexible load

Arxiv

0+阅读 · 2023年3月3日

BP-NTT: Fast and Compact in-SRAM Number Theoretic Transform with Bit-Parallel Modular Multiplication

Arxiv

0+阅读 · 2023年3月3日

Generalizing Lloyd's algorithm for graph clustering

Arxiv

0+阅读 · 2023年3月3日

Matrix Decomposition and Applications

Arxiv

54+阅读 · 2022年1月1日

VIP会员

文章信息

相关主题

Processing（编程语言）

state-of-the-art

查准率/准确率

相关VIP内容

宾夕法尼亚大学最新《不确定性估计》课程笔记，134页pdf，附Slides

宾夕法尼亚大学最新《不确定性估计》课程笔记，134页pdf，附Slides

专知会员服务

49+阅读 · 2022年11月13日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

【新书：机器学习简介】《A Concise Introduction to Machine Learning》by A.C. Faul (CRC 2019)

专知会员服务

77+阅读 · 2020年2月8日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【ACL2025教程】大语言模型的护栏与安全性：对其应用的安全、可靠与可控引导

《实现协同自主：从人机协作到多智能体系统》最新190页

【ICML2025】SToFM：一种用于空间转录组学的多尺度基础模型

通信网络智能体白皮书V1.0，61页pdf

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

AIART 2022 Call for Papers

AIART 2022 Call for Papers

CCF多媒体专委会

1+阅读 · 2022年2月13日

【ICIG2021】Latest News & Announcements of the Tutorial

【ICIG2021】Latest News & Announcements of the Tutorial

中国图象图形学学会CSIG

3+阅读 · 2021年12月20日

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium5

【ICIG2021】Check out the hot new trailer of ICIG2021 Symposium5

中国图象图形学学会CSIG

1+阅读 · 2021年11月11日

【ICIG2021】Latest News & Announcements of the Plenary Talk1

【ICIG2021】Latest News & Announcements of the Plenary Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年11月1日

【ICIG2021】Latest News & Announcements of the Industry Talk1

【ICIG2021】Latest News & Announcements of the Industry Talk1

中国图象图形学学会CSIG

0+阅读 · 2021年7月28日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

19+阅读 · 2017年12月17日

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

【推荐】ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

机器学习研究会

20+阅读 · 2017年12月17日

相关论文

A Multiplicative Value Function for Safe and Efficient Reinforcement Learning

Arxiv

0+阅读 · 2023年3月7日

Computing Effective Resistances on Large Graphs Based on Approximate Inverse of Cholesky Factor

Arxiv

0+阅读 · 2023年3月7日

Convergence Rates for Non-Log-Concave Sampling and Log-Partition Estimation

Arxiv

0+阅读 · 2023年3月6日

Multi-Order Networks for Action Unit Detection

Arxiv

0+阅读 · 2023年3月6日

PaReNTT: Low-Latency Parallel Residue Number System and NTT-Based Long Polynomial Modular Multiplication for Homomorphic Encryption

Arxiv

0+阅读 · 2023年3月3日

GPU Computation of the Euler Characteristic Curve for Imaging Data

Arxiv

0+阅读 · 2023年3月3日

Optimal scheduling of park-level integrated energy system considering ladder-type carbon trading mechanism and flexible load

Arxiv

0+阅读 · 2023年3月3日

BP-NTT: Fast and Compact in-SRAM Number Theoretic Transform with Bit-Parallel Modular Multiplication

Arxiv

0+阅读 · 2023年3月3日

Generalizing Lloyd's algorithm for graph clustering

Arxiv

0+阅读 · 2023年3月3日

Matrix Decomposition and Applications

Arxiv

54+阅读 · 2022年1月1日

相关基金

手性畴壁对铁电薄膜电学性能调控的相场研究

国家自然科学基金

0+阅读 · 2015年12月31日

具有TRIP效应高强韧钢的亚稳奥氏体演变规律、影响因素与增塑增韧机理

国家自然科学基金

0+阅读 · 2014年12月31日

Poisson流形上的修正Hamilton方法

国家自然科学基金

0+阅读 · 2014年12月31日

CPU-GPU耦合架构下数据库连接技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

混凝土Weibull统计尺寸效应理论模型改进研究

国家自然科学基金

0+阅读 · 2013年12月31日

Kronheimer-Nakajima quiver 模空间与有理曲面

国家自然科学基金

1+阅读 · 2013年12月31日

Co/Pt多层膜反常霍尔效应中周期性振荡机理研究

国家自然科学基金

0+阅读 · 2013年12月31日

Eulerian bond-cubic 模型渗流性质的数值研究

国家自然科学基金

0+阅读 · 2012年12月31日

Al2O3和TiOx在CaO-CaF2-SiO2渣系的热力学研究

国家自然科学基金

0+阅读 · 2011年12月31日

考虑微结构随机性的三维高阶MRCT多尺度计算理论研究

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员