平稳量:大语言模式培训后准确、高效的量化 (SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models) - 专知论文

会员服务 ·

0

Weight · 语言模型化 · 可约的 · 模型评估 · MoDELS ·

2023 年 2 月 14 日

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

翻译：平稳量:大语言模式培训后准确、高效的量化

Guangxuan Xiao,Ji Lin,Mickael Seznec,Hao Wu,Julien Demouth,Song Han

from arxiv, The first two authors contributed equally to this work

Large language models (LLMs) show excellent performance but are compute- and memory-intensive. Quantization can reduce memory and accelerate inference. However, for LLMs beyond 100 billion parameters, existing methods cannot maintain accuracy or do not run efficiently on hardware. We propose SmoothQuant, a training-free, accuracy-preserving, and general-purpose post-training quantization (PTQ) solution to enable 8-bit weight, 8-bit activation (W8A8) quantization for LLMs. Based on the fact that weights are easy to quantize while activations are not, SmoothQuant smooths the activation outliers by offline migrating the quantization difficulty from activations to weights with a mathematically equivalent transformation. SmoothQuant enables an INT8 quantization of both weights and activations for all the matrix multiplications in LLMs, including OPT-175B, BLOOM-176B, GLM-130B, and MT-NLG 530B. SmoothQuant has better hardware efficiency than existing techniques. We demonstrate up to 1.56x speedup and 2x memory reduction for LLMs with negligible loss in accuracy. We integrate SmoothQuant into FasterTransformer, a state-of-the-art LLM serving framework, and achieve faster inference speed with half the number of GPUs compared to FP16, enabling the serving of a 530B LLM within a single node. Our work offers a turn-key solution that reduces hardware costs and democratizes LLMs. Code is available at https://github.com/mit-han-lab/smoothquant.

翻译：大型语言模型(LLMS) 表现优异,但可进行计算和记忆密集。量化可以减少内存并加速推断。但是,对于超过1000亿参数的LLMS, 现有方法无法保持准确性, 或无法在硬件上有效运行。我们建议平准Quat, 这是一种无培训、准确性保存和通用的训练后量化(PTQ) 解决方案, 使LMS的重量达到8比重, 8比特激活(W8A8) 和记忆密集度。基于权重在启动时容易量化,而启动速度不快。平准Quat通过从启动到数学等量变换的重量的重量调离线将振动困难平平平下启动。平准QualQPMS, 将INT8的重量平分解和LMS的所有矩阵倍增倍化, 包括OM-176B、GLM-130B和M-NT-NLG OFG 530B。平准度平时比现有技术的硬件效率效率效率更高,, 我们的平流- mex- massal- mex- deal-molt-moxxxxxxxxx 快速化了我们快速缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩框架框架框架。

1

相关内容

Weight

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

专知会员服务

28+阅读 · 2022年2月20日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

【马毅老师新书课件】低维模型进行高维数据分析:原理、计算和应用，710页pdf

【马毅老师新书课件】低维模型进行高维数据分析:原理、计算和应用，710页pdf

专知

6+阅读 · 2022年4月27日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Deep Compression/Acceleration：模型压缩加速论文汇总

Deep Compression/Acceleration：模型压缩加速论文汇总

极市平台

14+阅读 · 2019年5月15日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

基于多次反射偏振相位法的高分辨率和高精度滚转角测量技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

贫氧富燃料条件下CO及O2浓度对煤焦-NO反应机理的影响规律

国家自然科学基金

0+阅读 · 2013年12月31日

混凝土Weibull统计尺寸效应理论模型改进研究

国家自然科学基金

0+阅读 · 2013年12月31日

中国BTV-1与BTV-4毒株Seg2重排致病毒变异机制的研究

国家自然科学基金

0+阅读 · 2012年12月31日

新型抗生素Bagremycins生物合成基因簇的鉴定与解析

国家自然科学基金

0+阅读 · 2012年12月31日

云南辣椒双生病毒种类及致病性研究

国家自然科学基金

0+阅读 · 2012年12月31日

BEC-BCS交叉中超流费米气体集体激发的Landau阻尼和频移

国家自然科学基金

0+阅读 · 2012年12月31日

水稻OsCAS（Calcium-sensing Receptor）基因的功能分析

国家自然科学基金

0+阅读 · 2009年12月31日

TRAIL作为治疗银屑病新的药物作用靶点

国家自然科学基金

0+阅读 · 2008年12月31日

急性淋巴细胞白血病（ALL）逃逸NK细胞杀伤的机制研究

国家自然科学基金

0+阅读 · 2008年12月31日

RPTQ: Reorder-based Post-training Quantization for Large Language Models

Arxiv

0+阅读 · 2023年4月6日

Towards an Effective and Efficient Transformer for Rain-by-snow Weather Removal

Arxiv

0+阅读 · 2023年4月6日

MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention

Arxiv

0+阅读 · 2023年4月4日

LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models

Arxiv

0+阅读 · 2023年4月4日

The G-Wishart Weighted Proposal Algorithm: Efficient Posterior Computation for Gaussian Graphical Models

Arxiv

0+阅读 · 2023年4月4日

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Arxiv

0+阅读 · 2023年4月3日

Block-wise Bit-Compression of Transformer-based Models

Arxiv

0+阅读 · 2023年4月1日

A Survey of Quantization Methods for Efficient Neural Network Inference

Arxiv

22+阅读 · 2021年6月21日

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Arxiv

12+阅读 · 2020年6月23日

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Arxiv

14+阅读 · 2019年8月8日

VIP会员

文章信息

相关主题

语言模型化

相关VIP内容

百篇论文纵览大型语言模型最新研究进展

百篇论文纵览大型语言模型最新研究进展

专知会员服务

70+阅读 · 2023年3月31日

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

Artificial Intelligence: Ready to Ride the Wave? BCG 28页PPT

专知会员服务

28+阅读 · 2022年2月20日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《战区安全决策课程体系》最新244页

《"无人机航母"原型平台》

任务规划与地形分析：现代复杂环境作战导航体系

《攻击场景描述形式化模型研究》

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

【马毅老师新书课件】低维模型进行高维数据分析:原理、计算和应用，710页pdf

【马毅老师新书课件】低维模型进行高维数据分析:原理、计算和应用，710页pdf

专知

6+阅读 · 2022年4月27日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Deep Compression/Acceleration：模型压缩加速论文汇总

Deep Compression/Acceleration：模型压缩加速论文汇总

极市平台

14+阅读 · 2019年5月15日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

RPTQ: Reorder-based Post-training Quantization for Large Language Models

Arxiv

0+阅读 · 2023年4月6日

Towards an Effective and Efficient Transformer for Rain-by-snow Weather Removal

Arxiv

0+阅读 · 2023年4月6日

MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention

Arxiv

0+阅读 · 2023年4月4日

LLM-Adapters: An Adapter Family for Parameter-Efficient Fine-Tuning of Large Language Models

Arxiv

0+阅读 · 2023年4月4日

The G-Wishart Weighted Proposal Algorithm: Efficient Posterior Computation for Gaussian Graphical Models

Arxiv

0+阅读 · 2023年4月4日

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Arxiv

0+阅读 · 2023年4月3日

Block-wise Bit-Compression of Transformer-based Models

Arxiv

0+阅读 · 2023年4月1日

A Survey of Quantization Methods for Efficient Neural Network Inference

Arxiv

22+阅读 · 2021年6月21日

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

Arxiv

12+阅读 · 2020年6月23日

Cluster-GCN: An Efficient Algorithm for Training Deep and Large Graph Convolutional Networks

Arxiv

14+阅读 · 2019年8月8日

相关基金

基于多次反射偏振相位法的高分辨率和高精度滚转角测量技术研究

国家自然科学基金

0+阅读 · 2013年12月31日

贫氧富燃料条件下CO及O2浓度对煤焦-NO反应机理的影响规律

国家自然科学基金

0+阅读 · 2013年12月31日

混凝土Weibull统计尺寸效应理论模型改进研究

国家自然科学基金

0+阅读 · 2013年12月31日

中国BTV-1与BTV-4毒株Seg2重排致病毒变异机制的研究

国家自然科学基金

0+阅读 · 2012年12月31日

新型抗生素Bagremycins生物合成基因簇的鉴定与解析

国家自然科学基金

0+阅读 · 2012年12月31日

云南辣椒双生病毒种类及致病性研究

国家自然科学基金

0+阅读 · 2012年12月31日

BEC-BCS交叉中超流费米气体集体激发的Landau阻尼和频移

国家自然科学基金

0+阅读 · 2012年12月31日

水稻OsCAS（Calcium-sensing Receptor）基因的功能分析

国家自然科学基金

0+阅读 · 2009年12月31日

TRAIL作为治疗银屑病新的药物作用靶点

国家自然科学基金

0+阅读 · 2008年12月31日

急性淋巴细胞白血病（ALL）逃逸NK细胞杀伤的机制研究

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员