深度学习推荐模型 (DLRM) 训练效率的提高——使用异构存储器的 MTrainS (MTrainS: Improving DLRM training efficiency using heterogeneous memories) - 专知论文

会员服务 ·

0

训练效率 · 存储器 · 嵌入表 · 内存 · 存储 ·

2023 年 4 月 19 日

MTrainS: Improving DLRM training efficiency using heterogeneous memories

翻译：深度学习推荐模型 (DLRM) 训练效率的提高——使用异构存储器的 MTrainS

Hiwot Tadese Kassa,Paul Johnson,Jason Akers,Mrinmoy Ghosh,Andrew Tulloch,Dheevatsa Mudigere,Jongsoo Park,Xing Liu,Ronald Dreslinski,Ehsan K. Ardestani

Recommendation models are very large, requiring terabytes (TB) of memory during training. In pursuit of better quality, the model size and complexity grow over time, which requires additional training data to avoid overfitting. This model growth demands a large number of resources in data centers. Hence, training efficiency is becoming considerably more important to keep the data center power demand manageable. In Deep Learning Recommendation Models (DLRM), sparse features capturing categorical inputs through embedding tables are the major contributors to model size and require high memory bandwidth. In this paper, we study the bandwidth requirement and locality of embedding tables in real-world deployed models. We observe that the bandwidth requirement is not uniform across different tables and that embedding tables show high temporal locality. We then design MTrainS, which leverages heterogeneous memory, including byte and block addressable Storage Class Memory for DLRM hierarchically. MTrainS allows for higher memory capacity per node and increases training efficiency by lowering the need to scale out to multiple hosts in memory capacity bound use cases. By optimizing the platform memory hierarchy, we reduce the number of nodes for training by 4-8X, saving power and cost of training while meeting our target training performance.

翻译：推荐模型非常大，训练时需要 TB 级别的内存。为了更好的质量，模型的大小和复杂度随着时间的推移而增长，这需要更多的训练数据以避免过拟合。模型的这种增长需要大量数据中心的资源。因此，训练效率对于管理数据中心的功耗变得非常重要。在 DLRM 中，通过嵌入表捕获分类输入的稀疏特征是模型大小的主要贡献因素，需要高内存带宽。在本文中，我们研究了在实际部署模型中嵌入表的带宽需求和局部性。我们发现，不同表的带宽需求不均匀，而且嵌入表表现出高时间局部性。然后，我们设计了 MTrainS，它使用包括 Byte 和块寻址存储类存储器在内的异构存储器体系结构逐级地处理了 DLRM。MTrainS 允许每个节点具有更高的内存容量，并通过降低在内存容量受限情况下扩展到多个主机的需求，提高了训练效率。通过优化平台内存体系结构，我们将训练所需的节点数降低了 4-8 倍，节省了训练的功耗和成本，同时实现了目标训练性能。

0

相关内容

训练效率

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

【SIGIR2021】ScaleFreeCTR：超大规模Embedding推荐模型分布式训练系统

专知会员服务

28+阅读 · 2021年4月26日

系列教程GNN-algorithms之七：《图同构网络—GIN》

系列教程GNN-algorithms之七：《图同构网络—GIN》

专知会员服务

48+阅读 · 2020年8月9日

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

专知会员服务

26+阅读 · 2020年3月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【Google 大脑】使用上千个优化任务学习超参数搜索策略，Using a thousand optimization tasks to learn hyperparameter search strategies

【Google 大脑】使用上千个优化任务学习超参数搜索策略，Using a thousand optimization tasks to learn hyperparameter search strategies

专知会员服务

18+阅读 · 2020年3月14日

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

专知会员服务

43+阅读 · 2019年11月25日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

NeurlPS2022推荐系统论文集锦

NeurlPS2022推荐系统论文集锦

机器学习与推荐算法

1+阅读 · 2022年9月26日

系列教程GNN-algorithms之七：《图同构网络—GIN》

系列教程GNN-algorithms之七：《图同构网络—GIN》

专知

84+阅读 · 2020年8月9日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Deep Compression/Acceleration：模型压缩加速论文汇总

Deep Compression/Acceleration：模型压缩加速论文汇总

极市平台

14+阅读 · 2019年5月15日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

在Python中使用SpaCy进行文本分类

在Python中使用SpaCy进行文本分类

专知

24+阅读 · 2018年5月8日

社交网络环境下基于协同过滤的上下文感知推荐系统研究

国家自然科学基金

6+阅读 · 2014年12月31日

GPU程序访存行为分析和优化关键技术研究

国家自然科学基金

1+阅读 · 2013年12月31日

异构动态移动通信网络的延时优化

国家自然科学基金

2+阅读 · 2013年12月31日

Ad hoc网络中基于博弈论的激励合作路由算法研究

国家自然科学基金

0+阅读 · 2013年12月31日

动态频谱环境下认知无线网络信息分发机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

面向绿色云计算的先进光数据中心网络支持下的能量有效数据管理

国家自然科学基金

0+阅读 · 2012年12月31日

不同垒层厚度并掺杂的GaNAs基短周期超晶格太阳能电池与MBE生长研究

国家自然科学基金

0+阅读 · 2012年12月31日

具有延迟约束的无线网络资源控制机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于用户节点自组织协作的分布式存储系统研究

国家自然科学基金

0+阅读 · 2009年12月31日

驴Cathelicidin EA-CATH1的结构与功能研究及分子设计

国家自然科学基金

0+阅读 · 2009年12月31日

Towards Better Orthogonality Regularization with Disentangled Norm in Training Deep CNNs

Arxiv

0+阅读 · 2023年6月16日

Parameter-efficient is not sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions

Arxiv

0+阅读 · 2023年6月16日

Improving Training Stability for Multitask Ranking Models in Recommender Systems

Arxiv

0+阅读 · 2023年6月15日

The Split Matters: Flat Minima Methods for Improving the Performance of GNNs

Arxiv

0+阅读 · 2023年6月15日

Pretraining Language Models with Human Preferences

Arxiv

0+阅读 · 2023年6月14日

Efficient Adapters for Giant Speech Models

Arxiv

0+阅读 · 2023年6月13日

On Efficient Training of Large-Scale Deep Learning Models: A Literature Review

Arxiv

214+阅读 · 2023年4月7日

Explainable Recommender Systems via Resolving Learning Representations

Arxiv

13+阅读 · 2020年8月21日

Feature Denoising for Improving Adversarial Robustness

Feature Denoising for Improving Adversarial Robustness

Arxiv

15+阅读 · 2018年12月9日

Learning Heterogeneous Knowledge Base Embeddings for Explainable Recommendation

Arxiv

11+阅读 · 2018年5月9日

VIP会员

文章信息

相关主题

相关VIP内容

【2022新书】高效深度学习，Efficient Deep Learning Book

【2022新书】高效深度学习，Efficient Deep Learning Book

专知会员服务

125+阅读 · 2022年4月21日

【SIGIR2021】ScaleFreeCTR：超大规模Embedding推荐模型分布式训练系统

专知会员服务

28+阅读 · 2021年4月26日

系列教程GNN-algorithms之七：《图同构网络—GIN》

系列教程GNN-algorithms之七：《图同构网络—GIN》

专知会员服务

48+阅读 · 2020年8月9日

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

【ICLR2020】用实对二进制卷积训练二进制神经网络，Training Binary Neural Networks with Real-to-Binary Convolutions

专知会员服务

26+阅读 · 2020年3月26日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

165+阅读 · 2020年3月18日

【Google 大脑】使用上千个优化任务学习超参数搜索策略，Using a thousand optimization tasks to learn hyperparameter search strategies

【Google 大脑】使用上千个优化任务学习超参数搜索策略，Using a thousand optimization tasks to learn hyperparameter search strategies

专知会员服务

18+阅读 · 2020年3月14日

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

【NLP模型的跨语言/跨领域迁移】《Transferring NLP models across languages and domains》

专知会员服务

43+阅读 · 2019年11月25日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

181+阅读 · 2019年10月11日

热门VIP内容

开通专知VIP会员享更多权益服务

人工智能治理的未来

模态感知的特征匹配：单一模态与跨模态技术的全面综述

无监督行人重识别研究综述

【牛津博士论文】面向神经影像应用的可扩展且可解释的空间模型

相关资讯

NeurlPS2022推荐系统论文集锦

NeurlPS2022推荐系统论文集锦

机器学习与推荐算法

1+阅读 · 2022年9月26日

系列教程GNN-algorithms之七：《图同构网络—GIN》

系列教程GNN-algorithms之七：《图同构网络—GIN》

专知

84+阅读 · 2020年8月9日

RoBERTa中文预训练模型：RoBERTa for Chinese

RoBERTa中文预训练模型：RoBERTa for Chinese

PaperWeekly

57+阅读 · 2019年9月16日

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

RoBERTa for Chinese：大规模中文预训练RoBERTa模型

AINLP

30+阅读 · 2019年9月8日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

Deep Compression/Acceleration：模型压缩加速论文汇总

Deep Compression/Acceleration：模型压缩加速论文汇总

极市平台

14+阅读 · 2019年5月15日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

在Python中使用SpaCy进行文本分类

在Python中使用SpaCy进行文本分类

专知

24+阅读 · 2018年5月8日

相关论文

Towards Better Orthogonality Regularization with Disentangled Norm in Training Deep CNNs

Arxiv

0+阅读 · 2023年6月16日

Parameter-efficient is not sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions

Arxiv

0+阅读 · 2023年6月16日

Improving Training Stability for Multitask Ranking Models in Recommender Systems

Arxiv

0+阅读 · 2023年6月15日

The Split Matters: Flat Minima Methods for Improving the Performance of GNNs

Arxiv

0+阅读 · 2023年6月15日

Pretraining Language Models with Human Preferences

Arxiv

0+阅读 · 2023年6月14日

Efficient Adapters for Giant Speech Models

Arxiv

0+阅读 · 2023年6月13日

On Efficient Training of Large-Scale Deep Learning Models: A Literature Review

Arxiv

214+阅读 · 2023年4月7日

Explainable Recommender Systems via Resolving Learning Representations

Arxiv

13+阅读 · 2020年8月21日

Feature Denoising for Improving Adversarial Robustness

Feature Denoising for Improving Adversarial Robustness

Arxiv

15+阅读 · 2018年12月9日

Learning Heterogeneous Knowledge Base Embeddings for Explainable Recommendation

Arxiv

11+阅读 · 2018年5月9日

相关基金

社交网络环境下基于协同过滤的上下文感知推荐系统研究

国家自然科学基金

6+阅读 · 2014年12月31日

GPU程序访存行为分析和优化关键技术研究

国家自然科学基金

1+阅读 · 2013年12月31日

异构动态移动通信网络的延时优化

国家自然科学基金

2+阅读 · 2013年12月31日

Ad hoc网络中基于博弈论的激励合作路由算法研究

国家自然科学基金

0+阅读 · 2013年12月31日

动态频谱环境下认知无线网络信息分发机制研究

国家自然科学基金

0+阅读 · 2013年12月31日

面向绿色云计算的先进光数据中心网络支持下的能量有效数据管理

国家自然科学基金

0+阅读 · 2012年12月31日

不同垒层厚度并掺杂的GaNAs基短周期超晶格太阳能电池与MBE生长研究

国家自然科学基金

0+阅读 · 2012年12月31日

具有延迟约束的无线网络资源控制机制研究

国家自然科学基金

0+阅读 · 2011年12月31日

基于用户节点自组织协作的分布式存储系统研究

国家自然科学基金

0+阅读 · 2009年12月31日

驴Cathelicidin EA-CATH1的结构与功能研究及分子设计

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员