多 GPU 基础 HPC 结构快速和可缩放的 Sparse 三角解答器 (Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures) - 专知论文

会员服务 ·

0

稀疏 · FAST · 可交换的 · INFORMS · Performer ·

2020 年 12 月 13 日

Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures

翻译：多 GPU 基础 HPC 结构快速和可缩放的 Sparse 三角解答器

Chenhao Xie,Jieyang Chen,Jesun S Firoz,Jiajia Li,Shuaiwen Leon Song,Kevin Barker,Mark Raugas,Ang Li

Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a daunting task due to significant irregular memory references and workload imbalance across the GPUs. This is particularly the case for Sparse Triangular Solver (SpTRSV) which introduces additional two-dimensional computation dependencies among subsequent computation steps. Dependency information is exchanged and shared among GPUs, thus warrant for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization support. In this work, we demonstrate that directly adopting unified memory can adversely affect the performance of SpTRSV on multi-GPU architectures, despite linking via fast interconnect like NVLinks and NVSwitches. Alternatively, we employ the latest NVSHMEM technology based on Partitioned Global Address Space programming model to enable efficient fine-grained communication and drastic synchronization overhead reduction. Furthermore, to handle workload imbalance, we propose a malleable task-pool execution model which can further enhance the utilization of GPUs. By applying these techniques, our experiments on the NVIDIA multi-GPU supernode V100-DGX-1 and DGX-2 systems demonstrate that our design can achieve on average 3.53x (up to 9.86x) speedup on a DGX-1 system and 3.66x (up to 9.64x) speedup on a DGX-2 system with 4-GPUs over the Unified-Memory design. The comprehensive sensitivity and scalability studies also show that the proposed zero-copy SpTRSV is able to fully utilize the computing and communication resources of the multi-GPU system.

翻译：在现代基于多GPU的HPC系统上设计高效且可缩放的线性直升升代数内核内核,是一项艰巨的任务,因为GPU之间有大量不规则的内存参考和工作量不平衡,特别是在Sprassy三角解答器(SpTRSV)的情况下,它使随后的计算步骤具有额外的二维计算依赖性。在GPU之间交换和共享依赖性信息,从而可以有效地分配内存、数据分割和工作量分配,以及精巧地拼凑通信和同步度支持。在这项工作中,我们证明直接采用统一记忆会严重影响SpTRSV在多GPU结构上的性能。尽管通过快速互联(如NVLinks和NVSwSwitches等)连接,我们利用最新的 NVSHMEM技术,以节化全球地址空间编程模型为基础,实现高效的微缩定位通信和快速同步的减少。此外,我们提议一个可变动性任务组合执行模式,可以进一步加强GPUS-3.S-2的利用率。通过应用这些技术,我们在VDG-DG-DG-DS-VX设计系统上进行测试的实验,可以显示VDG-DG-DG-DG-DG-DG-DG-DG-DVx平均的系统设计S-DVx的系统,在VT-D-D-D-D-D-D-D-VT-VT-D-D-VT-Vx系统上显示一个可全面的自动的自动的系统。

0

相关内容

ICLR2021放榜了！ 687篇入选34篇得满分！ 48篇orals，108篇spotlights，531篇poster

ICLR2021放榜了！ 687篇入选34篇得满分！ 48篇orals，108篇spotlights，531篇poster

专知会员服务

24+阅读 · 2021年1月13日

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

专知会员服务

33+阅读 · 2020年4月1日

【百度】-大规模深度学习广告系统的分布式分层GPU参数服务器，Distributed Hierarchical GPU PS

专知会员服务

24+阅读 · 2020年3月15日

【斯坦福大学】Gradient Surgery for Multi-Task Learning

【斯坦福大学】Gradient Surgery for Multi-Task Learning

专知会员服务

47+阅读 · 2020年1月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

31+阅读 · 2019年10月17日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

已删除

将门创投

3+阅读 · 2019年1月29日

Scalable nonparametric Bayesian learning for heterogeneous and dynamic velocity fields

Arxiv

0+阅读 · 2021年2月15日

EdgeBERT: Optimizing On-Chip Inference for Multi-Task NLP

Arxiv

0+阅读 · 2021年2月15日

Grid-GCN for Fast and Scalable Point Cloud Learning

Arxiv

1+阅读 · 2021年2月14日

A fast and scalable computational framework for goal-oriented linear Bayesian optimal experimental design: Application to optimal sensor placement

Arxiv

0+阅读 · 2021年2月12日

Neural Architecture Search as Program Transformation Exploration

Neural Architecture Search as Program Transformation Exploration

Arxiv

0+阅读 · 2021年2月12日

Deep Reinforcement Agent for Scheduling in HPC

Arxiv

0+阅读 · 2021年2月11日

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems

Arxiv

7+阅读 · 2020年3月12日

Neural Architecture Optimization

Neural Architecture Optimization

Arxiv

8+阅读 · 2018年9月5日

ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design

ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design

Arxiv

4+阅读 · 2018年7月30日

CuLDA_CGS: Solving Large-scale LDA Problems on GPUs

Arxiv

3+阅读 · 2018年3月13日

VIP会员

文章信息

相关主题

相关VIP内容

ICLR2021放榜了！ 687篇入选34篇得满分！ 48篇orals，108篇spotlights，531篇poster

ICLR2021放榜了！ 687篇入选34篇得满分！ 48篇orals，108篇spotlights，531篇poster

专知会员服务

24+阅读 · 2021年1月13日

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

专知会员服务

33+阅读 · 2020年4月1日

【百度】-大规模深度学习广告系统的分布式分层GPU参数服务器，Distributed Hierarchical GPU PS

专知会员服务

24+阅读 · 2020年3月15日

【斯坦福大学】Gradient Surgery for Multi-Task Learning

【斯坦福大学】Gradient Surgery for Multi-Task Learning

专知会员服务

47+阅读 · 2020年1月23日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

《DeepGCNs: Making GCNs Go as Deep as CNNs》

《DeepGCNs: Making GCNs Go as Deep as CNNs》

专知会员服务

31+阅读 · 2019年10月17日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

热门VIP内容

开通专知VIP会员享更多权益服务

《无人机战争时代的战时法：大国竞争中的区分原则、相称性原则与行动建议》最新75页

《构建强健军事力量的设计挑战：提升海军兵力支持系统效能的多分辨率建模方法》69页

正视无人机心理战：恐惧效应与战略反思

《精确反蜂群防御系统：三维运动探测与定向空爆拦截技术融合》最新24页

相关资讯

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

已删除

将门创投

3+阅读 · 2019年1月29日

相关论文

Scalable nonparametric Bayesian learning for heterogeneous and dynamic velocity fields

Arxiv

0+阅读 · 2021年2月15日

EdgeBERT: Optimizing On-Chip Inference for Multi-Task NLP

Arxiv

0+阅读 · 2021年2月15日

Grid-GCN for Fast and Scalable Point Cloud Learning

Arxiv

1+阅读 · 2021年2月14日

A fast and scalable computational framework for goal-oriented linear Bayesian optimal experimental design: Application to optimal sensor placement

Arxiv

0+阅读 · 2021年2月12日

Neural Architecture Search as Program Transformation Exploration

Neural Architecture Search as Program Transformation Exploration

Arxiv

0+阅读 · 2021年2月12日

Deep Reinforcement Agent for Scheduling in HPC

Arxiv

0+阅读 · 2021年2月11日

Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems

Arxiv

7+阅读 · 2020年3月12日

Neural Architecture Optimization

Neural Architecture Optimization

Arxiv

8+阅读 · 2018年9月5日

ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design

ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design

Arxiv

4+阅读 · 2018年7月30日

CuLDA_CGS: Solving Large-scale LDA Problems on GPUs

Arxiv

3+阅读 · 2018年3月13日

微信扫码咨询专知VIP会员