深神经网络大规模培训的校准研究 (A Study of Checkpointing in Large Scale Training of Deep Neural Networks) - 专知论文

会员服务 ·

0

容差 · 缩放 · Neural Networks · Networking · Performer ·

2021 年 3 月 29 日

A Study of Checkpointing in Large Scale Training of Deep Neural Networks

翻译：深神经网络大规模培训的校准研究

Elvis Rojas,Albert Njoroge Kahira,Esteban Meneses,Leonardo Bautista Gomez,Rosa M Badia

Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model training. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC workloads. We perform experiments with three state-of-the-art DL frameworks common in HPC Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide takeaway points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.

翻译：深入学习(DL)应用程序正越来越多地用于HPC系统,以利用这些系统的大规模平行和计算能力进行DL模式培训。虽然已经作出很大努力促进DL框架的分布式培训,但差错容忍度在很大程度上被忽视。在这项工作中,我们评价了检查站重新启动,这是HPC工作量中常见的差错容忍技术。我们试验了HPC链路、PyTorch和TensorFlow共同的三个最先进的DL框架。我们评价了检查站的计算成本、文件格式和文件大小、规模和确定性检查站的影响。我们的评价显示检查站机制存在一些关键差异,暴露了现有检查站执行中的瓶颈。我们提供了一些讨论点,可以帮助用户选择一个在HPC中使用的差错容忍框架。我们还提供了框架开发者可以用来帮助改进HPC对DL工作量的检查的抽取点。

0

相关内容

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【ICML2020】深度神经网络置信感知学习，Conﬁdence-Aware Learning for Deep Neural Networks

【ICML2020】深度神经网络置信感知学习，Conﬁdence-Aware Learning for Deep Neural Networks

专知会员服务

74+阅读 · 2020年7月6日

【KDD2020】图神经网络生成式预训练，GPT-GNN: Generative Pre-Training of Graph Neural Networks

【KDD2020】图神经网络生成式预训练，GPT-GNN: Generative Pre-Training of Graph Neural Networks

专知会员服务

99+阅读 · 2020年7月3日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

专知会员服务

4+阅读 · 2020年1月7日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

2019年机器学习框架回顾

2019年机器学习框架回顾

专知会员服务

36+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

计算机类 | PLDI 2020等国际会议信息6条

计算机类 | PLDI 2020等国际会议信息6条

Call4Papers

3+阅读 · 2019年7月8日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

NIPS 2017：贝叶斯深度学习与深度贝叶斯学习（讲义+视频）

NIPS 2017：贝叶斯深度学习与深度贝叶斯学习（讲义+视频）

机器学习研究会

36+阅读 · 2017年12月10日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

基于深度学习的医疗影像论文汇总（Deep Learning Papers on Medical Image Analysis）

基于深度学习的医疗影像论文汇总（Deep Learning Papers on Medical Image Analysis）

AI研习社

17+阅读 · 2017年10月21日

Copyright in Generative Deep Learning

Arxiv

0+阅读 · 2021年5月19日

A Survey on Bayesian Deep Learning

A Survey on Bayesian Deep Learning

Arxiv

64+阅读 · 2020年7月2日

A Comparison of Neural Network Training Methods for Text Classification

Arxiv

6+阅读 · 2019年10月28日

Deep Learning in Video Multi-Object Tracking: A Survey

Deep Learning in Video Multi-Object Tracking: A Survey

Arxiv

58+阅读 · 2019年7月31日

Learning to Weight for Text Classification

Learning to Weight for Text Classification

Arxiv

8+阅读 · 2019年3月28日

Transfer Learning with Neural AutoML

Arxiv

5+阅读 · 2018年9月11日

Deep Learning

Arxiv

6+阅读 · 2018年8月3日

A Study on Overfitting in Deep Reinforcement Learning

Arxiv

7+阅读 · 2018年4月20日

TBD: Benchmarking and Analyzing Deep Neural Network Training

Arxiv

3+阅读 · 2018年3月16日

A Survey on Multi-Task Learning

Arxiv

5+阅读 · 2017年7月25日

VIP会员

文章信息

相关主题

Neural Networks

相关VIP内容

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【ICML2020】深度神经网络置信感知学习，Conﬁdence-Aware Learning for Deep Neural Networks

【ICML2020】深度神经网络置信感知学习，Conﬁdence-Aware Learning for Deep Neural Networks

专知会员服务

74+阅读 · 2020年7月6日

【KDD2020】图神经网络生成式预训练，GPT-GNN: Generative Pre-Training of Graph Neural Networks

【KDD2020】图神经网络生成式预训练，GPT-GNN: Generative Pre-Training of Graph Neural Networks

专知会员服务

99+阅读 · 2020年7月3日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

【Google 76分钟训练万BERT最新论文】Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

专知会员服务

4+阅读 · 2020年1月7日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

2019年机器学习框架回顾

2019年机器学习框架回顾

专知会员服务

36+阅读 · 2019年10月11日

机器学习入门的经验与建议

机器学习入门的经验与建议

专知会员服务

94+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

人工智能治理的未来

模态感知的特征匹配：单一模态与跨模态技术的全面综述

无监督行人重识别研究综述

【牛津博士论文】面向神经影像应用的可扩展且可解释的空间模型

相关资讯

计算机类 | PLDI 2020等国际会议信息6条

计算机类 | PLDI 2020等国际会议信息6条

Call4Papers

3+阅读 · 2019年7月8日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

NIPS 2017：贝叶斯深度学习与深度贝叶斯学习（讲义+视频）

NIPS 2017：贝叶斯深度学习与深度贝叶斯学习（讲义+视频）

机器学习研究会

36+阅读 · 2017年12月10日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

基于深度学习的医疗影像论文汇总（Deep Learning Papers on Medical Image Analysis）

基于深度学习的医疗影像论文汇总（Deep Learning Papers on Medical Image Analysis）

AI研习社

17+阅读 · 2017年10月21日

相关论文

Copyright in Generative Deep Learning

Arxiv

0+阅读 · 2021年5月19日

A Survey on Bayesian Deep Learning

A Survey on Bayesian Deep Learning

Arxiv

64+阅读 · 2020年7月2日

A Comparison of Neural Network Training Methods for Text Classification

Arxiv

6+阅读 · 2019年10月28日

Deep Learning in Video Multi-Object Tracking: A Survey

Deep Learning in Video Multi-Object Tracking: A Survey

Arxiv

58+阅读 · 2019年7月31日

Learning to Weight for Text Classification

Learning to Weight for Text Classification

Arxiv

8+阅读 · 2019年3月28日

Transfer Learning with Neural AutoML

Arxiv

5+阅读 · 2018年9月11日

Deep Learning

Arxiv

6+阅读 · 2018年8月3日

A Study on Overfitting in Deep Reinforcement Learning

Arxiv

7+阅读 · 2018年4月20日

TBD: Benchmarking and Analyzing Deep Neural Network Training

Arxiv

3+阅读 · 2018年3月16日

A Survey on Multi-Task Learning

Arxiv

5+阅读 · 2017年7月25日

微信扫码咨询专知VIP会员