面向NVIDIA V100 GPUs的开放式MP编目者业绩评估 (Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs) - 专知论文

会员服务 ·

0

Performer · 编译器 · 有向 · 英伟达（NVIDIA） · 稳健性 ·

2020 年 12 月 2 日

Performance Assessment of OpenMP Compilers Targeting NVIDIA V100 GPUs

翻译：面向NVIDIA V100 GPUs的开放式MP编目者业绩评估

Joshua Hoke Davis,Christopher Daley,Swaroop Pophale,Thomas Huber,Sunita Chandrasekaran,Nicholas J. Wright

from arxiv, 20 pages, 7 figures, accepted in WACCPD 2020 at SC20 (under publication with Springer)

Heterogeneous systems are becoming increasingly prevalent. In order to exploit the rich compute resources of such systems, robust programming models are needed for application developers to seamlessly migrate legacy code from today's systems to tomorrow's. Over the past decade and more, directives have been established as one of the promising paths to tackle programmatic challenges on emerging systems. This work focuses on applying and demonstrating OpenMP offloading directives on five proxy applications. We observe that the performance varies widely from one compiler to the other; a crucial aspect of our work is reporting best practices to application developers who use OpenMP offloading compilers. While some issues can be worked around by the developer, there are other issues that must be reported to the compiler vendors. By restructuring OpenMP offloading directives, we gain an 18x speedup for the su3 proxy application on NERSC's Cori system when using the Clang compiler, and a 15.7x speedup by switching max reductions to add reductions in the laplace mini-app when using the Cray-llvm compiler on Cori.

翻译：为了利用这些系统的丰富计算资源,应用程序开发者需要强有力的编程模型,以便从今天的系统将遗留代码无缝地从今天的系统迁移到明天的系统。在过去的十年和更多的十年中,指令被确定为解决新兴系统方案挑战的有希望的途径之一。这项工作的重点是应用和演示关于五个代理应用程序的 OpenMP 卸载指令。我们观察到,从一个编译者到另一个代用软件的性能差异很大;我们工作的一个重要方面是向使用 OpenMP 卸载编译器的应用开发者报告最佳做法。虽然有些问题可以由开发者处理,但还有其他问题必须报告给编译者供应商。通过调整 OpenMP 卸载指令,我们在使用Clan 编译器时,为 NERSC Cori系统中的 su3 代用软件加速了18x加速度,在使用Clan 编译器时,通过转换最大减速来减少粘贴微型应用程序。使用Cray-llm 编译器时,我们得到了15.7x的加速。

0

相关内容

Performer

【2020新书】实战测试自动化，Practical Test Automation，327页pdf

【2020新书】实战测试自动化，Practical Test Automation，327页pdf

专知会员服务

34+阅读 · 2020年8月26日

【陈天奇】TVM：端到端自动深度学习编译器，244页ppt

【陈天奇】TVM：端到端自动深度学习编译器，244页ppt

专知会员服务

87+阅读 · 2020年5月11日

【阿里巴巴】 AI编译器，AI Compiler @ Alibaba，21页ppt

【阿里巴巴】 AI编译器，AI Compiler @ Alibaba，21页ppt

专知会员服务

45+阅读 · 2019年12月22日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

开源书：PyTorch深度学习起步

开源书：PyTorch深度学习起步

专知会员服务

51+阅读 · 2019年10月11日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【新书】Python编程基础，669页pdf

【新书】Python编程基础，669页pdf

专知会员服务

197+阅读 · 2019年10月10日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

3倍加速CPU上的BERT模型部署

3倍加速CPU上的BERT模型部署

ApacheMXNet

11+阅读 · 2020年7月13日

分布式并行架构Ray介绍

分布式并行架构Ray介绍

CreateAMind

10+阅读 · 2019年8月9日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

NLP 2018 Highlights：2018自然语言处理技术亮点汇总

NLP 2018 Highlights：2018自然语言处理技术亮点汇总

AINLP

10+阅读 · 2019年2月9日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

AI/ML/DNN硬件加速设计怎么入门？

AI/ML/DNN硬件加速设计怎么入门？

StarryHeavensAbove

11+阅读 · 2018年12月4日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

开发者应当了解的18套机器学习平台

开发者应当了解的18套机器学习平台

深度学习世界

5+阅读 · 2018年8月14日

前端高性能计算（4）：GPU加速计算

前端高性能计算（4）：GPU加速计算

前端大全

7+阅读 · 2017年10月26日

给DNN处理器跑个分 - 指标篇

给DNN处理器跑个分 - 指标篇

StarryHeavensAbove

5+阅读 · 2017年7月9日

A data relocation approach for terrain surface analysis on multi-GPU systems: a case study on the total viewshed problem

Arxiv

0+阅读 · 2021年1月22日

Distributed Compilation System for High-Speed Software Build Processes

Arxiv

0+阅读 · 2021年1月21日

Efficient MPI-based Communication for GPU-Accelerated Dask Applications

Arxiv

0+阅读 · 2021年1月21日

GPU-Accelerated Optimizer-Aware Evaluation of Submodular Exemplar Clustering

Arxiv

0+阅读 · 2021年1月21日

Cain: Automatic Code Generation for Simultaneous Convolutional Kernels on Focal-plane Sensor-processors

Arxiv

0+阅读 · 2021年1月21日

RADAR: Run-time Adversarial Weight Attack Detection and Accuracy Recovery

Arxiv

0+阅读 · 2021年1月20日

LightSys: Lightweight and Efficient CI System for Improving Integration Speed of Software

Arxiv

0+阅读 · 2021年1月20日

A 3D Coarse-to-Fine Framework for Volumetric Medical Image Segmentation

A 3D Coarse-to-Fine Framework for Volumetric Medical Image Segmentation

Arxiv

15+阅读 · 2018年8月2日

CuLDA_CGS: Solving Large-scale LDA Problems on GPUs

Arxiv

3+阅读 · 2018年3月13日

A Big Data Analysis Framework Using Apache Spark and Deep Learning

Arxiv

3+阅读 · 2017年11月25日

VIP会员

文章信息

相关主题

英伟达（NVIDIA）

相关VIP内容

【2020新书】实战测试自动化，Practical Test Automation，327页pdf

【2020新书】实战测试自动化，Practical Test Automation，327页pdf

专知会员服务

34+阅读 · 2020年8月26日

【陈天奇】TVM：端到端自动深度学习编译器，244页ppt

【陈天奇】TVM：端到端自动深度学习编译器，244页ppt

专知会员服务

87+阅读 · 2020年5月11日

【阿里巴巴】 AI编译器，AI Compiler @ Alibaba，21页ppt

【阿里巴巴】 AI编译器，AI Compiler @ Alibaba，21页ppt

专知会员服务

45+阅读 · 2019年12月22日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

开源书：PyTorch深度学习起步

开源书：PyTorch深度学习起步

专知会员服务

51+阅读 · 2019年10月11日

强化学习最新教程，17页pdf

强化学习最新教程，17页pdf

专知会员服务

182+阅读 · 2019年10月11日

【新书】Python编程基础，669页pdf

【新书】Python编程基础，669页pdf

专知会员服务

197+阅读 · 2019年10月10日

[综述]深度学习下的场景文本检测与识别

[综述]深度学习下的场景文本检测与识别

专知会员服务

78+阅读 · 2019年10月10日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

军事战术边缘计算的重要性

《欧洲天空盾牌倡议：应对无人机饱和攻击与高超音速导弹的多层防空演进与挑战》报告

《美军使用大语言模型技术生成领域特定文档》2025最新379页

《代理生成式人工智能与国家安全：提升竞争力的政策建议》

相关资讯

3倍加速CPU上的BERT模型部署

3倍加速CPU上的BERT模型部署

ApacheMXNet

11+阅读 · 2020年7月13日

分布式并行架构Ray介绍

分布式并行架构Ray介绍

CreateAMind

10+阅读 · 2019年8月9日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

NLP 2018 Highlights：2018自然语言处理技术亮点汇总

NLP 2018 Highlights：2018自然语言处理技术亮点汇总

AINLP

10+阅读 · 2019年2月9日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

AI/ML/DNN硬件加速设计怎么入门？

AI/ML/DNN硬件加速设计怎么入门？

StarryHeavensAbove

11+阅读 · 2018年12月4日

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

pytorch-pretrained-BERT：BERT PyTorch实现，可加载Google BERT预训练模型

AINLP

35+阅读 · 2018年11月6日

开发者应当了解的18套机器学习平台

开发者应当了解的18套机器学习平台

深度学习世界

5+阅读 · 2018年8月14日

前端高性能计算（4）：GPU加速计算

前端高性能计算（4）：GPU加速计算

前端大全

7+阅读 · 2017年10月26日

给DNN处理器跑个分 - 指标篇

给DNN处理器跑个分 - 指标篇

StarryHeavensAbove

5+阅读 · 2017年7月9日

相关论文

A data relocation approach for terrain surface analysis on multi-GPU systems: a case study on the total viewshed problem

Arxiv

0+阅读 · 2021年1月22日

Distributed Compilation System for High-Speed Software Build Processes

Arxiv

0+阅读 · 2021年1月21日

Efficient MPI-based Communication for GPU-Accelerated Dask Applications

Arxiv

0+阅读 · 2021年1月21日

GPU-Accelerated Optimizer-Aware Evaluation of Submodular Exemplar Clustering

Arxiv

0+阅读 · 2021年1月21日

Cain: Automatic Code Generation for Simultaneous Convolutional Kernels on Focal-plane Sensor-processors

Arxiv

0+阅读 · 2021年1月21日

RADAR: Run-time Adversarial Weight Attack Detection and Accuracy Recovery

Arxiv

0+阅读 · 2021年1月20日

LightSys: Lightweight and Efficient CI System for Improving Integration Speed of Software

Arxiv

0+阅读 · 2021年1月20日

A 3D Coarse-to-Fine Framework for Volumetric Medical Image Segmentation

A 3D Coarse-to-Fine Framework for Volumetric Medical Image Segmentation

Arxiv

15+阅读 · 2018年8月2日

CuLDA_CGS: Solving Large-scale LDA Problems on GPUs

Arxiv

3+阅读 · 2018年3月13日

A Big Data Analysis Framework Using Apache Spark and Deep Learning

Arxiv

3+阅读 · 2017年11月25日

微信扫码咨询专知VIP会员