通过利用内存次区域相邻性,使高通量处理器能够使用大型回收 TLB 用于高通量处理器 (Enabling Large-Reach TLBs for High-Throughput Processors by Exploiting Memory Subregion Contiguity) - 专知论文

会员服务 ·

0

Performer · INFORMS · 分解的 · 粤港澳大湾区数字经济研究院 · MoDELS ·

2021 年 10 月 16 日

Enabling Large-Reach TLBs for High-Throughput Processors by Exploiting Memory Subregion Contiguity

翻译：通过利用内存次区域相邻性,使高通量处理器能够使用大型回收 TLB 用于高通量处理器

Chao Yu,Yuebin Bai,Rui Wang

Accelerators, like GPUs, have become a trend to deliver future performance desire, and sharing the same virtual memory space between CPUs and GPUs is increasingly adopted to simplify programming. However, address translation, which is the key factor of virtual memory, is becoming the bottleneck of performance for GPUs. In GPUs, a single TLB miss can stall hundreds of threads due to the SIMT execute model, degrading performance dramatically. Through real system analysis, we observe that the OS shows an advanced contiguity (e.g., hundreds of contiguous pages), and more large memory regions with advanced contiguity tend to be allocated with the increase of working sets. Leveraging the observation, we propose MESC to improve the translation efficiency for GPUs. The key idea of MESC is to divide each large page frame (2MB size) in virtual memory space into memory subregions with fixed size (i.e., 64 4KB pages), and store the contiguity information of subregions and large page frames in L2PTEs. With MESC, address translations of up to 512 pages can be coalesced into single TLB entry, without the needs of changing memory allocation policy (i.e., demand paging) and the support of large pages. In the experimental results, MESC achieves 77.2% performance improvement and 76.4% reduction in dynamic translation energy for translation-sensitive workloads.

翻译：加速器,如GPU,已经成为未来实现业绩愿望的趋势,并且越来越采用CPU和GPU之间共享相同的虚拟记忆空间来简化编程。但是,地址翻译(这是虚拟记忆的关键要素)正在成为GPU性能的瓶颈。在GPU中,单个TLB错误会因为SIMT执行模式而拖累数百条线索,其性能会大大降低。通过真正的系统分析,我们观察到OS显示一个先进的连续性(例如,数百个相连页面),以及更多具有高级毗连的记忆区会随着工作设置的增加而分配。我们利用观察,我们建议MESC提高GPU的翻译效率。MESC的关键想法是将虚拟记忆空间中的每个大页框架(2MB大小)分成固定大小的记忆区(即,64 4KB页),并将各次区域的连续性信息和大页面框架储存在L2PTPTEDE中。由于MESC,最多达512页的地址翻译会随着工作组合而增加。我们建议MESC 提高77LB的翻译效率,不需要将大规模的磁性政策翻译。

0

相关内容

Performer

【伯克利-Pieter Abbeel】深度强化学习基础，附slides与视频

专知会员服务

29+阅读 · 2021年8月26日

【图灵奖2020得主Aho和Ullman经典龙书】《计算机算法设计与分析》(1974)

【图灵奖2020得主Aho和Ullman经典龙书】《计算机算法设计与分析》(1974)

专知会员服务

29+阅读 · 2021年4月1日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【百度】-大规模深度学习广告系统的分布式分层GPU参数服务器，Distributed Hierarchical GPU PS

专知会员服务

24+阅读 · 2020年3月15日

【多伦多大学】神经数据服务器:用于传输学习数据的大型搜索引擎，Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data

【多伦多大学】神经数据服务器:用于传输学习数据的大型搜索引擎，Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data

专知会员服务

7+阅读 · 2020年1月9日

【课程推荐】人工普遍智能（Artificial General Intelligence）

【课程推荐】人工普遍智能（Artificial General Intelligence）

专知会员服务

11+阅读 · 2019年11月10日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

已删除

将门创投

5+阅读 · 2019年9月10日

计算机类 | PLDI 2020等国际会议信息6条

计算机类 | PLDI 2020等国际会议信息6条

Call4Papers

3+阅读 · 2019年7月8日

CCF A类 | 顶级会议RTSS 2019诚邀稿件

CCF A类 | 顶级会议RTSS 2019诚邀稿件

Call4Papers

10+阅读 · 2019年4月17日

计算机 | CCF推荐期刊专刊信息5条

计算机 | CCF推荐期刊专刊信息5条

Call4Papers

3+阅读 · 2019年4月10日

Ray RLlib: Scalable 降龙十八掌

Ray RLlib: Scalable 降龙十八掌

CreateAMind

9+阅读 · 2018年12月28日

计算机类 | ISCC 2019等国际会议信息9条

计算机类 | ISCC 2019等国际会议信息9条

Call4Papers

5+阅读 · 2018年12月25日

计算机类 | LICS 2019等国际会议信息7条

计算机类 | LICS 2019等国际会议信息7条

Call4Papers

3+阅读 · 2018年12月17日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

自然语言处理（二）机器翻译篇 (NLP: machine translation)

自然语言处理（二）机器翻译篇 (NLP: machine translation)

DeepLearning中文论坛

12+阅读 · 2015年7月1日

TCUDB: Accelerating Database with Tensor Processors

Arxiv

0+阅读 · 2021年12月14日

Sherman: A Write-Optimized Distributed B+Tree Index on Disaggregated Memory

Arxiv

0+阅读 · 2021年12月14日

MCDS: AI Augmented Workflow Scheduling in Mobile Edge Cloud Computing Systems

Arxiv

0+阅读 · 2021年12月14日

Contention Based Proportional Fairness (CBPF) Transmission Scheme for Time Slotted Channel Hopping Networks

Arxiv

0+阅读 · 2021年12月14日

Fast and Accurate Light Field Saliency Detection through Deep Encoding

Arxiv

0+阅读 · 2021年12月13日

AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning

Arxiv

0+阅读 · 2021年12月13日

Progressive Seed Generation Auto-encoder for Unsupervised Point Cloud Learning

Arxiv

0+阅读 · 2021年12月9日

Japanese Predicate Conjugation for Neural Machine Translation

Arxiv

3+阅读 · 2018年5月25日

Beyond Trade-off: Accelerate FCN-based Face Detector with Higher Accuracy

Arxiv

4+阅读 · 2018年4月14日

Neural Response Generation with Dynamic Vocabularies

Arxiv

5+阅读 · 2017年11月30日

VIP会员

文章信息

相关主题

粤港澳大湾区数字经济研究院

相关VIP内容

【伯克利-Pieter Abbeel】深度强化学习基础，附slides与视频

专知会员服务

29+阅读 · 2021年8月26日

【图灵奖2020得主Aho和Ullman经典龙书】《计算机算法设计与分析》(1974)

【图灵奖2020得主Aho和Ullman经典龙书】《计算机算法设计与分析》(1974)

专知会员服务

29+阅读 · 2021年4月1日

2020数据工程师成长路线图

专知会员服务

41+阅读 · 2020年9月6日

Linux导论，Introduction to Linux，96页ppt

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

【百度】-大规模深度学习广告系统的分布式分层GPU参数服务器，Distributed Hierarchical GPU PS

专知会员服务

24+阅读 · 2020年3月15日

【多伦多大学】神经数据服务器:用于传输学习数据的大型搜索引擎，Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data

【多伦多大学】神经数据服务器:用于传输学习数据的大型搜索引擎，Neural Data Server: A Large-Scale Search Engine for Transfer Learning Data

专知会员服务

7+阅读 · 2020年1月9日

【课程推荐】人工普遍智能（Artificial General Intelligence）

【课程推荐】人工普遍智能（Artificial General Intelligence）

专知会员服务

11+阅读 · 2019年11月10日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

Stabilizing Transformers for Reinforcement Learning

Stabilizing Transformers for Reinforcement Learning

专知会员服务

60+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

《美国太空军系统全生命周期建模、仿真与分析效能提升方案》最新84页报告

《商用大语言模型的升级风险管理：国家安全运用》

自主人工智能：未来战争是否将是自主化的？

《从装备到文化：美陆军技术素养建设启示录》最新报告

相关资讯

已删除

将门创投

5+阅读 · 2019年9月10日

计算机类 | PLDI 2020等国际会议信息6条

计算机类 | PLDI 2020等国际会议信息6条

Call4Papers

3+阅读 · 2019年7月8日

CCF A类 | 顶级会议RTSS 2019诚邀稿件

CCF A类 | 顶级会议RTSS 2019诚邀稿件

Call4Papers

10+阅读 · 2019年4月17日

计算机 | CCF推荐期刊专刊信息5条

计算机 | CCF推荐期刊专刊信息5条

Call4Papers

3+阅读 · 2019年4月10日

Ray RLlib: Scalable 降龙十八掌

Ray RLlib: Scalable 降龙十八掌

CreateAMind

9+阅读 · 2018年12月28日

计算机类 | ISCC 2019等国际会议信息9条

计算机类 | ISCC 2019等国际会议信息9条

Call4Papers

5+阅读 · 2018年12月25日

计算机类 | LICS 2019等国际会议信息7条

计算机类 | LICS 2019等国际会议信息7条

Call4Papers

3+阅读 · 2018年12月17日

Auto-Encoding GAN

Auto-Encoding GAN

CreateAMind

7+阅读 · 2017年8月4日

强化学习族谱

强化学习族谱

CreateAMind

26+阅读 · 2017年8月2日

自然语言处理（二）机器翻译篇 (NLP: machine translation)

自然语言处理（二）机器翻译篇 (NLP: machine translation)

DeepLearning中文论坛

12+阅读 · 2015年7月1日

相关论文

TCUDB: Accelerating Database with Tensor Processors

Arxiv

0+阅读 · 2021年12月14日

Sherman: A Write-Optimized Distributed B+Tree Index on Disaggregated Memory

Arxiv

0+阅读 · 2021年12月14日

MCDS: AI Augmented Workflow Scheduling in Mobile Edge Cloud Computing Systems

Arxiv

0+阅读 · 2021年12月14日

Contention Based Proportional Fairness (CBPF) Transmission Scheme for Time Slotted Channel Hopping Networks

Arxiv

0+阅读 · 2021年12月14日

Fast and Accurate Light Field Saliency Detection through Deep Encoding

Arxiv

0+阅读 · 2021年12月13日

AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning

Arxiv

0+阅读 · 2021年12月13日

Progressive Seed Generation Auto-encoder for Unsupervised Point Cloud Learning

Arxiv

0+阅读 · 2021年12月9日

Japanese Predicate Conjugation for Neural Machine Translation

Arxiv

3+阅读 · 2018年5月25日

Beyond Trade-off: Accelerate FCN-based Face Detector with Higher Accuracy

Arxiv

4+阅读 · 2018年4月14日

Neural Response Generation with Dynamic Vocabularies

Arxiv

5+阅读 · 2017年11月30日

微信扫码咨询专知VIP会员