ASTRA-sim2.0: 模拟层次网络和分解系统以实现大规模模型训练的建模 (ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale) - 专知论文

会员服务 ·

0

分解 · 并行化 · 系统 · 内存 · 设计 ·

2023 年 3 月 24 日

ASTRA-sim2.0: Modeling Hierarchical Networks and Disaggregated Systems for Large-model Training at Scale

翻译：ASTRA-sim2.0: 模拟层次网络和分解系统以实现大规模模型训练的建模

William Won,Taekyung Heo,Saeed Rashidi,Srinivas Sridharan,Sudarshan Srinivasan,Tushar Krishna

As deep learning models and input data are scaling at an unprecedented rate, it is inevitable to move towards distributed training platforms to fit the model and increase training throughput. State-of-the-art approaches and techniques, such as wafer-scale nodes, multi-dimensional network topologies, disaggregated memory systems, and parallelization strategies, have been actively adopted by emerging distributed training systems. This results in a complex SW/HW co-design stack of distributed training, necessitating a modeling/simulation infrastructure for design-space exploration. In this paper, we extend the open-source ASTRA-sim infrastructure and endow it with the capabilities to model state-of-the-art and emerging distributed training models and platforms. More specifically, (i) we enable ASTRA-sim to support arbitrary model parallelization strategies via a graph-based training-loop implementation, (ii) we implement a parameterizable multi-dimensional heterogeneous topology generation infrastructure with analytical performance estimates enabling simulating target systems at scale, and (iii) we enhance the memory system modeling to support accurate modeling of in-network collective communication and disaggregated memory systems. With such capabilities, we run comprehensive case studies targeting emerging distributed models and platforms. This infrastructure lets system designers swiftly traverse the complex co-design stack and give meaningful insights when designing and deploying distributed training platforms at scale.

翻译：随着深度学习模型和输入数据的不断扩大，不可避免地要向分布式培训平台迈进，以适应模型并增加训练吞吐量。新兴的分布式训练系统积极采用尖端的方法和技术，例如晶圆节点、多维网络拓扑、分解内存系统和并行化策略。这导致了一个复杂的分布式培训软硬件共设计空间，需要建模或模拟基础设施来进行设计空间探索。本文扩展了开源ASTRA-sim基础设施，并赋予了模拟最先进和新兴的分布式培训模型和平台的能力。更具体地说，(i) 我们通过基于图形的训练循环实现支持任意模型并行化策略的ASTRA-sim，(ii) 我们实现了一个可参数化的多维异构拓扑生成基础设施，带有分析性能估计，能够模拟规模目标系统，并(iii) 我们增强了内存系统建模，支持精确建模网络内集体通信和分解内存系统。通过这些能力，我们运行了针对新兴分布式模型和平台的综合案例研究。此基础设施使系统设计人员能够快速遍历复杂的共设计空间，并在设计和部署大规模分布式培训平台时提供有意义的见解。

0

相关内容

美国空军《仿真、集成和建模的高级框架（AFSIM）》11页slides

美国空军《仿真、集成和建模的高级框架（AFSIM）》11页slides

专知会员服务

161+阅读 · 2022年9月21日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

斯坦福CS246《大数据挖掘》2021课程开始了！Jure Leskovec大牛主讲，附课程PPT下载

斯坦福CS246《大数据挖掘》2021课程开始了！Jure Leskovec大牛主讲，附课程PPT下载

专知会员服务

61+阅读 · 2021年5月10日

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

专知会员服务

67+阅读 · 2020年7月25日

【2020 最新论文】节点邻近的图池化的层次表示学习 Graph Pooling with Node Proximity for Hierarchical Representation Learning

【2020 最新论文】节点邻近的图池化的层次表示学习 Graph Pooling with Node Proximity for Hierarchical Representation Learning

专知会员服务

43+阅读 · 2020年7月19日

【百度】-大规模深度学习广告系统的分布式分层GPU参数服务器，Distributed Hierarchical GPU PS

专知会员服务

24+阅读 · 2020年3月15日

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

专知会员服务

17+阅读 · 2020年3月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【泡泡点云时空】基于增量分割的3D点云定位方法（ICRA2018-4）

【泡泡点云时空】基于增量分割的3D点云定位方法（ICRA2018-4）

泡泡机器人SLAM

13+阅读 · 2018年10月7日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新六篇推荐系统相关论文—注意力机制、多任务、协同跨网络、非结构化文本、TransRev、章节推荐

【论文推荐】最新六篇推荐系统相关论文—注意力机制、多任务、协同跨网络、非结构化文本、TransRev、章节推荐

专知

12+阅读 · 2018年4月26日

【论文推荐】最新八篇网络节点表示相关论文—可扩展嵌入、对抗自编码器、图划分、异构信息、显式矩阵分解、深度高斯、图、随机游走

【论文推荐】最新八篇网络节点表示相关论文—可扩展嵌入、对抗自编码器、图划分、异构信息、显式矩阵分解、深度高斯、图、随机游走

专知

14+阅读 · 2018年3月30日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

新算子分裂法及其在可分离优化中的应用

国家自然科学基金

0+阅读 · 2013年12月31日

分布式移动通信场景下的参数化信道建模及预测优化机制研究

国家自然科学基金

1+阅读 · 2013年12月31日

NVST高分辨观测系统数据处理研究

国家自然科学基金

0+阅读 · 2012年12月31日

支持网络服务可持续动态部署的关键机制及其节点模型

国家自然科学基金

0+阅读 · 2012年12月31日

复杂网络环境下Euler-Lagrange系统分布式协调控制问题研究

国家自然科学基金

0+阅读 · 2012年12月31日

云计算环境下支持复杂并行业务的高铁数据中心关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

基于非多余矩阵分离的二次指派问题SDP近似算法与应用

国家自然科学基金

0+阅读 · 2012年12月31日

矩阵分解的低延迟并行算法

国家自然科学基金

0+阅读 · 2009年12月31日

组合Web服务的建模与验证

国家自然科学基金

1+阅读 · 2008年12月31日

Deep RL with Hierarchical Action Exploration for Dialogue Generation

Arxiv

0+阅读 · 2023年5月15日

QuickSRNet: Plain Single-Image Super-Resolution Architecture for Faster Inference on Mobile Platforms

Arxiv

0+阅读 · 2023年5月14日

Learning Reward Machines in Cooperative Multi-Agent Tasks

Arxiv

0+阅读 · 2023年5月14日

Federated TD Learning over Finite-Rate Erasure Channels: Linear Speedup under Markovian Sampling

Arxiv

0+阅读 · 2023年5月14日

How to Use Reinforcement Learning to Facilitate Future Electricity Market Design? Part 2: Method and Applications

Arxiv

0+阅读 · 2023年5月12日

Foundations of Spatial Perception for Robotics: Hierarchical Representations and Real-time Systems

Arxiv

0+阅读 · 2023年5月11日

A Roadmap for Big Model

Arxiv

76+阅读 · 2022年3月26日

A Survey of Deep Reinforcement Learning in Recommender Systems: A Systematic Review and Future Directions

Arxiv

14+阅读 · 2021年9月8日

Neural Architecture Search without Training

Neural Architecture Search without Training

Arxiv

10+阅读 · 2021年6月11日

Coding for Distributed Multi-Agent Reinforcement Learning

Arxiv

32+阅读 · 2021年1月7日

VIP会员

文章信息

相关主题

相关VIP内容

美国空军《仿真、集成和建模的高级框架（AFSIM）》11页slides

美国空军《仿真、集成和建模的高级框架（AFSIM）》11页slides

专知会员服务

161+阅读 · 2022年9月21日

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

斯坦福CS246《大数据挖掘》2021课程开始了！Jure Leskovec大牛主讲，附课程PPT下载

斯坦福CS246《大数据挖掘》2021课程开始了！Jure Leskovec大牛主讲，附课程PPT下载

专知会员服务

61+阅读 · 2021年5月10日

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

神经网络序列数据建模，229页ppt，Modeling Sequential Data with Neural Nets

专知会员服务

67+阅读 · 2020年7月25日

【2020 最新论文】节点邻近的图池化的层次表示学习 Graph Pooling with Node Proximity for Hierarchical Representation Learning

【2020 最新论文】节点邻近的图池化的层次表示学习 Graph Pooling with Node Proximity for Hierarchical Representation Learning

专知会员服务

43+阅读 · 2020年7月19日

【百度】-大规模深度学习广告系统的分布式分层GPU参数服务器，Distributed Hierarchical GPU PS

专知会员服务

24+阅读 · 2020年3月15日

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

【DeepMind】基于变换的大规模数据对抗视频预测，Transformation-based Adversarial Video Prediction on Large-Scale Data

专知会员服务

17+阅读 · 2020年3月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

大型语言模型遇上文本属性图：一种融合框架与应用的综述

人工智能赋能自主武器与人类控制第三部分：人类控制与系统操作员 | 35页

【博士论文】用于概率程序与生成模型的变分推断

军事指挥控制系统：2025年5种用途

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

【泡泡点云时空】基于增量分割的3D点云定位方法（ICRA2018-4）

【泡泡点云时空】基于增量分割的3D点云定位方法（ICRA2018-4）

泡泡机器人SLAM

13+阅读 · 2018年10月7日

disentangled-representation-papers

disentangled-representation-papers

CreateAMind

26+阅读 · 2018年9月12日

【论文推荐】最新六篇推荐系统相关论文—注意力机制、多任务、协同跨网络、非结构化文本、TransRev、章节推荐

【论文推荐】最新六篇推荐系统相关论文—注意力机制、多任务、协同跨网络、非结构化文本、TransRev、章节推荐

专知

12+阅读 · 2018年4月26日

【论文推荐】最新八篇网络节点表示相关论文—可扩展嵌入、对抗自编码器、图划分、异构信息、显式矩阵分解、深度高斯、图、随机游走

【论文推荐】最新八篇网络节点表示相关论文—可扩展嵌入、对抗自编码器、图划分、异构信息、显式矩阵分解、深度高斯、图、随机游走

专知

14+阅读 · 2018年3月30日

【推荐】RNN/LSTM时序预测

【推荐】RNN/LSTM时序预测

机器学习研究会

25+阅读 · 2017年9月8日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

相关论文

Deep RL with Hierarchical Action Exploration for Dialogue Generation

Arxiv

0+阅读 · 2023年5月15日

QuickSRNet: Plain Single-Image Super-Resolution Architecture for Faster Inference on Mobile Platforms

Arxiv

0+阅读 · 2023年5月14日

Learning Reward Machines in Cooperative Multi-Agent Tasks

Arxiv

0+阅读 · 2023年5月14日

Federated TD Learning over Finite-Rate Erasure Channels: Linear Speedup under Markovian Sampling

Arxiv

0+阅读 · 2023年5月14日

How to Use Reinforcement Learning to Facilitate Future Electricity Market Design? Part 2: Method and Applications

Arxiv

0+阅读 · 2023年5月12日

Foundations of Spatial Perception for Robotics: Hierarchical Representations and Real-time Systems

Arxiv

0+阅读 · 2023年5月11日

A Roadmap for Big Model

Arxiv

76+阅读 · 2022年3月26日

A Survey of Deep Reinforcement Learning in Recommender Systems: A Systematic Review and Future Directions

Arxiv

14+阅读 · 2021年9月8日

Neural Architecture Search without Training

Neural Architecture Search without Training

Arxiv

10+阅读 · 2021年6月11日

Coding for Distributed Multi-Agent Reinforcement Learning

Arxiv

32+阅读 · 2021年1月7日

相关基金

新算子分裂法及其在可分离优化中的应用

国家自然科学基金

0+阅读 · 2013年12月31日

分布式移动通信场景下的参数化信道建模及预测优化机制研究

国家自然科学基金

1+阅读 · 2013年12月31日

NVST高分辨观测系统数据处理研究

国家自然科学基金

0+阅读 · 2012年12月31日

支持网络服务可持续动态部署的关键机制及其节点模型

国家自然科学基金

0+阅读 · 2012年12月31日

复杂网络环境下Euler-Lagrange系统分布式协调控制问题研究

国家自然科学基金

0+阅读 · 2012年12月31日

云计算环境下支持复杂并行业务的高铁数据中心关键技术研究

国家自然科学基金

0+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

基于非多余矩阵分离的二次指派问题SDP近似算法与应用

国家自然科学基金

0+阅读 · 2012年12月31日

矩阵分解的低延迟并行算法

国家自然科学基金

0+阅读 · 2009年12月31日

组合Web服务的建模与验证

国家自然科学基金

1+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员