GPUs 的带宽- 最佳随机打乱 (Bandwidth-Optimal Random Shuffling for GPUs) - 专知论文

会员服务 ·

0

可约的 · Performer · 统计量 · 流 · 核化 ·

2021 年 6 月 11 日

Bandwidth-Optimal Random Shuffling for GPUs

翻译：GPUs 的带宽- 最佳随机打乱

Rory Mitchell,Daniel Stokes,Eibe Frank,Geoffrey Holmes

Linear-time algorithms that are traditionally used to shuffle data on CPUs, such as the method of Fisher-Yates, are not well suited to implementation on GPUs due to inherent sequential dependencies. Moreover, existing parallel shuffling algorithms show unsatisfactory performance on GPU architectures because they incur a large number of read/write operations to high latency global memory. To address this, we provide a method of generating pseudo-random permutations in parallel by fusing suitable pseudo-random bijective functions with stream compaction operations. Our algorithm, termed `bijective shuffle' trades increased per-thread arithmetic operations for reduced global memory transactions. It is work-efficient, deterministic, and only requires a single global memory read and write per shuffle input, thus maximising use of global memory bandwidth. To empirically demonstrate the correctness of the algorithm, we develop a consistent, linear time, statistical test for the quality of pseudo-random permutations based on kernel space embeddings. Empirical results show that the bijective shuffle algorithm outperforms competing algorithms on multicore CPUs and GPUs, showing improvements of between one and two orders of magnitude and approaching peak device bandwidth.

翻译：传统上用于打乱CPU数据的线性算法,例如Fisher-Yates方法,由于固有的相继依附性,并不适合在GPU上实施。此外,现有的平行打乱算法在GPU结构中表现不尽人意,因为GPU结构中产生了大量的读/写操作,可以达到高延缓度全球记忆。为了解决这个问题,我们提供了一种方法,通过在流压操作中使用适当的假随机双向函数来生成假随机变异。我们的算法,称为“弹性打乱”交易,增加了用于减少全球内存交易的单读计算操作。它具有工作效率,具有确定性,只需要单项全球内存读写一次,从而最大限度地使用全球内存带宽度。为了以实验方式证明算法的正确性,我们开发了一个一致的线性时间,用于根据内嵌空间嵌成的假随机调整质量的统计测试。 Enprialalalal 结果表明, 双导式平流平流平级平级和两等级级级级级级之间, 的双级平级平级平级平级平级平级平级平级的平级的平级平级平等。

0

相关内容

可约的

【硬核书】Linux核心编程|Linux Kernel Programming，741页pdf

【硬核书】Linux核心编程|Linux Kernel Programming，741页pdf

专知会员服务

80+阅读 · 2021年3月26日

【Manning新书】C++并行实战，592页pdf，C++ Concurrency in Action

【Manning新书】C++并行实战，592页pdf，C++ Concurrency in Action

专知会员服务

63+阅读 · 2021年1月16日

最新《图理论》笔记书，98页pdf

最新《图理论》笔记书，98页pdf

专知会员服务

76+阅读 · 2020年12月27日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

Python分布式计算，171页pdf，Distributed Computing with Python

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

分布式并行架构Ray介绍

分布式并行架构Ray介绍

CreateAMind

10+阅读 · 2019年8月9日

Java 线程究竟占用多少内存

Java 线程究竟占用多少内存

ImportNew

6+阅读 · 2019年6月27日

已删除

将门创投

5+阅读 · 2019年4月29日

TorchSeg：基于pytorch的语义分割算法开源了

TorchSeg：基于pytorch的语义分割算法开源了

极市平台

20+阅读 · 2019年1月28日

人工智能 | UAI 2019等国际会议信息4条

人工智能 | UAI 2019等国际会议信息4条

Call4Papers

6+阅读 · 2019年1月14日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

使用 MPI for Python 并行化遗传算法

使用 MPI for Python 并行化遗传算法

Python开发者

5+阅读 · 2017年8月4日

Computational complexity of Inexact Proximal Point Algorithm for Convex Optimization under Holderian Growth

Arxiv

0+阅读 · 2021年8月11日

Scalable Bayesian transport maps for high-dimensional non-Gaussian spatial fields

Scalable Bayesian transport maps for high-dimensional non-Gaussian spatial fields

Arxiv

0+阅读 · 2021年8月9日

Single-Training Collaborative Object Detectors Adaptive to Bandwidth and Computation

Arxiv

0+阅读 · 2021年8月9日

Scalable and Elastic LiDAR Reconstruction in Complex Environments Through Spatial Analysis

Arxiv

0+阅读 · 2021年8月9日

From Domain-Specific Languages to Memory-Optimized Accelerators for Fluid Dynamics

Arxiv

0+阅读 · 2021年8月6日

Memory-Aware Partitioning of Machine Learning Applications for Optimal Energy Use in Batteryless Systems

Arxiv

0+阅读 · 2021年8月5日

DynaBERT: Dynamic BERT with Adaptive Width and Depth

Arxiv

8+阅读 · 2020年10月9日

Towards Scalable Spectral Clustering via Spectrum-Preserving Sparsification

Towards Scalable Spectral Clustering via Spectrum-Preserving Sparsification

Arxiv

4+阅读 · 2018年10月11日

CuLDA_CGS: Solving Large-scale LDA Problems on GPUs

Arxiv

3+阅读 · 2018年3月13日

Parameter Space Noise for Exploration

Arxiv

3+阅读 · 2018年1月31日

VIP会员

文章信息

相关主题

相关VIP内容

【硬核书】Linux核心编程|Linux Kernel Programming，741页pdf

【硬核书】Linux核心编程|Linux Kernel Programming，741页pdf

专知会员服务

80+阅读 · 2021年3月26日

【Manning新书】C++并行实战，592页pdf，C++ Concurrency in Action

【Manning新书】C++并行实战，592页pdf，C++ Concurrency in Action

专知会员服务

63+阅读 · 2021年1月16日

最新《图理论》笔记书，98页pdf

最新《图理论》笔记书，98页pdf

专知会员服务

76+阅读 · 2020年12月27日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

Python分布式计算，171页pdf，Distributed Computing with Python

Python分布式计算，171页pdf，Distributed Computing with Python

专知会员服务

108+阅读 · 2020年5月3日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

【CMU卡内基梅隆大学】深度学习在计算机视觉的应用：方法，解释，因果与公平性

专知会员服务

83+阅读 · 2019年10月9日

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

【SIGGRAPH2019】TensorFlow 2.0深度学习计算机图形学应用

专知会员服务

41+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

【牛津博士论文】零样本强化学习综述

《美军条令：陆军指挥官与规划人员地理空间指南》60页

战术边缘指挥控制：防务面临的核心挑战

迈向开放世界检测：综述

相关资讯

分布式并行架构Ray介绍

分布式并行架构Ray介绍

CreateAMind

10+阅读 · 2019年8月9日

Java 线程究竟占用多少内存

Java 线程究竟占用多少内存

ImportNew

6+阅读 · 2019年6月27日

已删除

将门创投

5+阅读 · 2019年4月29日

TorchSeg：基于pytorch的语义分割算法开源了

TorchSeg：基于pytorch的语义分割算法开源了

极市平台

20+阅读 · 2019年1月28日

人工智能 | UAI 2019等国际会议信息4条

人工智能 | UAI 2019等国际会议信息4条

Call4Papers

6+阅读 · 2019年1月14日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

meta learning 17年：MAML SNAIL

meta learning 17年：MAML SNAIL

CreateAMind

11+阅读 · 2019年1月2日

【推荐】自然语言处理（NLP）指南

【推荐】自然语言处理（NLP）指南

机器学习研究会

35+阅读 · 2017年11月17日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

使用 MPI for Python 并行化遗传算法

使用 MPI for Python 并行化遗传算法

Python开发者

5+阅读 · 2017年8月4日

相关论文

Computational complexity of Inexact Proximal Point Algorithm for Convex Optimization under Holderian Growth

Arxiv

0+阅读 · 2021年8月11日

Scalable Bayesian transport maps for high-dimensional non-Gaussian spatial fields

Scalable Bayesian transport maps for high-dimensional non-Gaussian spatial fields

Arxiv

0+阅读 · 2021年8月9日

Single-Training Collaborative Object Detectors Adaptive to Bandwidth and Computation

Arxiv

0+阅读 · 2021年8月9日

Scalable and Elastic LiDAR Reconstruction in Complex Environments Through Spatial Analysis

Arxiv

0+阅读 · 2021年8月9日

From Domain-Specific Languages to Memory-Optimized Accelerators for Fluid Dynamics

Arxiv

0+阅读 · 2021年8月6日

Memory-Aware Partitioning of Machine Learning Applications for Optimal Energy Use in Batteryless Systems

Arxiv

0+阅读 · 2021年8月5日

DynaBERT: Dynamic BERT with Adaptive Width and Depth

Arxiv

8+阅读 · 2020年10月9日

Towards Scalable Spectral Clustering via Spectrum-Preserving Sparsification

Towards Scalable Spectral Clustering via Spectrum-Preserving Sparsification

Arxiv

4+阅读 · 2018年10月11日

CuLDA_CGS: Solving Large-scale LDA Problems on GPUs

Arxiv

3+阅读 · 2018年3月13日

Parameter Space Noise for Exploration

Arxiv

3+阅读 · 2018年1月31日

微信扫码咨询专知VIP会员