OpenMP离线在最先进的加速器上的可移植性和可扩展性 (Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators) - 专知论文

会员服务 ·

0

可移植性 · 可扩展性 · 扩展性 · 移植 · GNU ·

2023 年 4 月 9 日

Portability and Scalability of OpenMP Offloading on State-of-the-art Accelerators

翻译：OpenMP离线在最先进的加速器上的可移植性和可扩展性

Yehonatan Fridman,Guy Tamir,Gal Oren

from arxiv, 13 pages

Over the last decade, most of the increase in computing power has been gained by advances in accelerated many-core architectures, mainly in the form of GPGPUs. While accelerators achieve phenomenal performances in various computing tasks, their utilization requires code adaptations and transformations. Thus, OpenMP, the most common standard for multi-threading in scientific computing applications, introduced offloading capabilities between host (CPUs) and accelerators since v4.0, with increasing support in the successive v4.5, v5.0, v5.1, and the latest v5.2 versions. Recently, two state-of-the-art GPUs - the Intel Ponte Vecchio Max 1100 and the NVIDIA A100 GPUs - were released to the market, with the oneAPI and GNU LLVM-backed compilation for offloading, correspondingly. In this work, we present early performance results of OpenMP offloading capabilities to these devices while specifically analyzing the potability of advanced directives (using SOLLVE's OMPVV test suite) and the scalability of the hardware in representative scientific mini-app (the LULESH benchmark). Our results show that the vast majority of the offloading directives in v4.5 and 5.0 are supported in the latest oneAPI and GNU compilers; however, the support in v5.1 and v5.2 is still lacking. From the performance perspective, we found that PVC is up to 37% better than the A100 on the LULESH benchmark, presenting better performance in computing and data movements.

翻译：在过去十年中，大多数计算能力的增长都来自于加速的多核体系结构的进步，主要以GPGPU的形式出现。虽然加速器在各种计算任务中取得了卓越的性能，但它们的利用需要代码适应和转换。因此，OpenMP是科学计算应用程序中最常见的多线程标准，自v4.0以来引入了主机（CPU）和加速器之间的离线功能，在后续的v4.5、v5.0、v5.1和最新的v5.2版本中得到了越来越多的支持。最近，两种最先进的GPU——英特尔Pont Vecchio Max 1100和NVIDIA A100 GPU——发布到市场上，相应地实现了oneAPI和GNU LLVM支持的离线编译。在这项工作中，我们展示了OpenMP离线功能在这些设备上的早期性能结果，同时特别分析高级指令的可移植性（使用SOLLVE的OMPVV测试套件）以及代表性科学微型应用程序（LULESH基准测试）的硬件可扩展性。我们的结果显示，在v4.5和5.0中，绝大多数离线指令都得到了最新oneAPI和GNU编译器的支持；然而，在v5.1和v5.2中的支持仍然不足。从性能角度看，我们发现PVC在LULESH基准测试中比A100高出37%，在计算和数据移动方面表现更好。

0

相关内容

可移植性

【SIGMOD教程】高效数据标签的众包实践:聚合、增量重标签和定价，附180页slides

【SIGMOD教程】高效数据标签的众包实践:聚合、增量重标签和定价，附180页slides

专知会员服务

11+阅读 · 2022年10月20日

【ICML2022】基于自适应上下文池化的高效表示学习

【ICML2022】基于自适应上下文池化的高效表示学习

专知会员服务

20+阅读 · 2022年7月9日

【深度神经网络加速器的硬件近似技术综述】Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey

【深度神经网络加速器的硬件近似技术综述】Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey

专知会员服务

16+阅读 · 2022年3月17日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

如何加速深度神经网络计算效率？看NVIDIA-ISSCC2021教程，附Slides与视频

如何加速深度神经网络计算效率？看NVIDIA-ISSCC2021教程，附Slides与视频

专知会员服务

34+阅读 · 2021年3月25日

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

专知会员服务

61+阅读 · 2019年12月29日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

Tensorflow 新一轮迭代路线图：更好的 XLA 编译和分布式计算

Tensorflow 新一轮迭代路线图：更好的 XLA 编译和分布式计算

InfoQ

0+阅读 · 2022年11月20日

不再让CPU和总线拖后腿：Exafunction让GPU跑的更快！

不再让CPU和总线拖后腿：Exafunction让GPU跑的更快！

机器之心

0+阅读 · 2022年10月7日

T-thinker | 继MapReduce, Apache Spark之后的下一代大数据并行编程框架

T-thinker | 继MapReduce, Apache Spark之后的下一代大数据并行编程框架

机器之心

0+阅读 · 2022年7月5日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

开放知识图谱

2+阅读 · 2022年5月20日

CALDERA 一款对手自动模拟工具

CALDERA 一款对手自动模拟工具

黑白之道

20+阅读 · 2019年9月17日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

【泡泡一分钟】在CPU上进行实时无监督单目深度估计

【泡泡一分钟】在CPU上进行实时无监督单目深度估计

泡泡机器人SLAM

17+阅读 · 2019年5月10日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

AI/ML/DNN硬件加速设计怎么入门？

AI/ML/DNN硬件加速设计怎么入门？

StarryHeavensAbove

11+阅读 · 2018年12月4日

基于编译的PCM内存损耗均衡方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向CFD并行应用开发框架的高效容错方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

通用异构并行密度泛函计算方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

面向高精度计算领域动态可配置加速器体系结构研究

国家自然科学基金

0+阅读 · 2013年12月31日

多核平台上的BESIII离线物理软件与调度策略研究

国家自然科学基金

0+阅读 · 2012年12月31日

众核平台的并行编程模型及其运行时支持技术的研究

国家自然科学基金

0+阅读 · 2012年12月31日

大规模计算网络并行任务调度模型及其参数方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

计算力学基本计算及可视化工具程序包的开发与集成

国家自然科学基金

2+阅读 · 2012年12月31日

跨平台的操作系统安全机制形式化验证方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于合成基准测试程序的多核处理器模拟技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

SLA Management in Intent-Driven Service Management Systems: A Taxonomy and Future Directions

Arxiv

0+阅读 · 2023年5月26日

InstaGrasp: An Entirely 3D Printed Adaptive Gripper with TPU Soft Elements and Minimal Assembly Time

Arxiv

0+阅读 · 2023年5月26日

The Power of Linear Recurrent Neural Networks

Arxiv

0+阅读 · 2023年5月25日

ACAI: Extending Arm Confidential Computing Architecture Protection from CPUs to Accelerators

Arxiv

0+阅读 · 2023年5月25日

A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions

Arxiv

16+阅读 · 2023年2月9日

A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and Future Directions

Arxiv

42+阅读 · 2022年6月15日

Self-Supervised Learning for Recommender Systems: A Survey

Arxiv

12+阅读 · 2022年3月29日

A Survey on Neural Speech Synthesis

Arxiv

14+阅读 · 2021年6月30日

Learning from Very Few Samples: A Survey

Arxiv

126+阅读 · 2020年9月6日

A Survey on Edge Computing Systems and Tools

Arxiv

35+阅读 · 2019年11月7日

VIP会员

文章信息

相关主题

相关VIP内容

【SIGMOD教程】高效数据标签的众包实践:聚合、增量重标签和定价，附180页slides

【SIGMOD教程】高效数据标签的众包实践:聚合、增量重标签和定价，附180页slides

专知会员服务

11+阅读 · 2022年10月20日

【ICML2022】基于自适应上下文池化的高效表示学习

【ICML2022】基于自适应上下文池化的高效表示学习

专知会员服务

20+阅读 · 2022年7月9日

【深度神经网络加速器的硬件近似技术综述】Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey

【深度神经网络加速器的硬件近似技术综述】Hardware Approximate Techniques for Deep Neural Network Accelerators: A Survey

专知会员服务

16+阅读 · 2022年3月17日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

如何加速深度神经网络计算效率？看NVIDIA-ISSCC2021教程，附Slides与视频

如何加速深度神经网络计算效率？看NVIDIA-ISSCC2021教程，附Slides与视频

专知会员服务

34+阅读 · 2021年3月25日

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

如何加速NVIDIA gpu上的训练、推理和ML应用？108页ppt，Accelerating training, inference, and ML applications on NVIDIA GPUs

专知会员服务

61+阅读 · 2019年12月29日

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

【O'Reilly AI Conference 2019】部署大规模分布式数据（How to deploy large-scale distributed data analytics and machine learning on containers (sponsored by HPE))，HPE BlueData，Thomas Phelan

专知会员服务

19+阅读 · 2019年11月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《人工智能绝不能完全自主》

《人工智能的法律与伦理：军事自主机器独特挑战的深度剖析》316页

从数据到主导：AI与兵棋推演构筑决策优势

《特洛伊木马货柜：武器化集装箱的战略威胁》最新报告

相关资讯

Tensorflow 新一轮迭代路线图：更好的 XLA 编译和分布式计算

Tensorflow 新一轮迭代路线图：更好的 XLA 编译和分布式计算

InfoQ

0+阅读 · 2022年11月20日

不再让CPU和总线拖后腿：Exafunction让GPU跑的更快！

不再让CPU和总线拖后腿：Exafunction让GPU跑的更快！

机器之心

0+阅读 · 2022年10月7日

T-thinker | 继MapReduce, Apache Spark之后的下一代大数据并行编程框架

T-thinker | 继MapReduce, Apache Spark之后的下一代大数据并行编程框架

机器之心

0+阅读 · 2022年7月5日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

征稿 | International Joint Conference on Knowledge Graphs (IJCKG)

开放知识图谱

2+阅读 · 2022年5月20日

CALDERA 一款对手自动模拟工具

CALDERA 一款对手自动模拟工具

黑白之道

20+阅读 · 2019年9月17日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

【泡泡一分钟】在CPU上进行实时无监督单目深度估计

【泡泡一分钟】在CPU上进行实时无监督单目深度估计

泡泡机器人SLAM

17+阅读 · 2019年5月10日

无监督元学习表示学习

无监督元学习表示学习

CreateAMind

27+阅读 · 2019年1月4日

AI/ML/DNN硬件加速设计怎么入门？

AI/ML/DNN硬件加速设计怎么入门？

StarryHeavensAbove

11+阅读 · 2018年12月4日

相关论文

SLA Management in Intent-Driven Service Management Systems: A Taxonomy and Future Directions

Arxiv

0+阅读 · 2023年5月26日

InstaGrasp: An Entirely 3D Printed Adaptive Gripper with TPU Soft Elements and Minimal Assembly Time

Arxiv

0+阅读 · 2023年5月26日

The Power of Linear Recurrent Neural Networks

Arxiv

0+阅读 · 2023年5月25日

ACAI: Extending Arm Confidential Computing Architecture Protection from CPUs to Accelerators

Arxiv

0+阅读 · 2023年5月25日

A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions

Arxiv

16+阅读 · 2023年2月9日

A Comprehensive Survey on Deep Clustering: Taxonomy, Challenges, and Future Directions

Arxiv

42+阅读 · 2022年6月15日

Self-Supervised Learning for Recommender Systems: A Survey

Arxiv

12+阅读 · 2022年3月29日

A Survey on Neural Speech Synthesis

Arxiv

14+阅读 · 2021年6月30日

Learning from Very Few Samples: A Survey

Arxiv

126+阅读 · 2020年9月6日

A Survey on Edge Computing Systems and Tools

Arxiv

35+阅读 · 2019年11月7日

相关基金

基于编译的PCM内存损耗均衡方法研究

国家自然科学基金

0+阅读 · 2015年12月31日

面向CFD并行应用开发框架的高效容错方法研究

国家自然科学基金

0+阅读 · 2013年12月31日

通用异构并行密度泛函计算方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

面向高精度计算领域动态可配置加速器体系结构研究

国家自然科学基金

0+阅读 · 2013年12月31日

多核平台上的BESIII离线物理软件与调度策略研究

国家自然科学基金

0+阅读 · 2012年12月31日

众核平台的并行编程模型及其运行时支持技术的研究

国家自然科学基金

0+阅读 · 2012年12月31日

大规模计算网络并行任务调度模型及其参数方法研究

国家自然科学基金

0+阅读 · 2012年12月31日

计算力学基本计算及可视化工具程序包的开发与集成

国家自然科学基金

2+阅读 · 2012年12月31日

跨平台的操作系统安全机制形式化验证方法研究

国家自然科学基金

0+阅读 · 2009年12月31日

基于合成基准测试程序的多核处理器模拟技术研究

国家自然科学基金

0+阅读 · 2009年12月31日

微信扫码咨询专知VIP会员