多核可扩展架构MemPool: 具有低延迟共享L1缓存的设计 (MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory) - 专知论文

会员服务 ·

0

低延迟 · 编程 · 多核系统 · 流式传输 · 引擎 ·

2023 年 3 月 30 日

MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory

翻译：多核可扩展架构MemPool: 具有低延迟共享L1缓存的设计

Samuel Riedel,Matheus Cavalcante,Renzo Andri,Luca Benini

from arxiv, 14 pages, 17 figures, 2 tables

Shared L1 memory clusters are a common architectural pattern (e.g., in GPGPUs) for building efficient and flexible multi-processing-element (PE) engines. However, it is a common belief that these tightly-coupled clusters would not scale beyond a few tens of PEs. In this work, we tackle scaling shared L1 clusters to hundreds of PEs while supporting a flexible and productive programming model and maintaining high efficiency. We present MemPool, a manycore system with 256 RV32IMAXpulpimg "Snitch" cores featuring application-tunable functional units. We designed and implemented an efficient low-latency PE to L1-memory interconnect, an optimized instruction path to ensure each PE's independent execution, and a powerful DMA engine and system interconnect to stream data in and out. MemPool is easy to program, with all the cores sharing a global view of a large, multi-banked, L1 scratchpad memory, accessible within at most five cycles in the absence of conflicts. We provide multiple runtimes to program MemPool at different abstraction levels and illustrate its versatility with a wide set of applications. MemPool runs at 600 MHz (60 gate delays) in typical conditions (TT/0.80V/25{\deg}C) in 22 nm FDX technology and achieves a performance of up to 229 GOPS or 192 GOPS/W with less than 2% of execution stalls.

翻译：共享L1缓存集群是构建高效灵活的多处理元素（PE）引擎的常见架构模式（例如，在GPGPUs中）。然而，人们普遍认为这些紧密耦合的集群不会在几十个PE之外扩展。在这项工作中，我们解决了将共享L1集群扩展到数百个PE时支持灵活高效的编程模型并保持高效的难题。我们提出了MemPool，一个具有256个RV32IMAXpulpimg“Snitch”内核的多核系统，具有应用程序可调功能单元。我们设计和实现了一种高效低延迟的PE到L1存储器互联，优化了指令路径以确保每个PE的独立执行，以及强大的DMA引擎和系统互连以流式传输数据。MemPool易于编程，所有内核共享大型多模块L1缓存，可在不发生冲突的情况下在最多五个周期内访问。我们为MemPool提供了多个运行时，以不同的抽象级别来编程，并使用广泛的应用程序说明了其多功能性。MemPool在22纳米FDX工艺条件下典型工作温度（TT/0.80V/25°C）下以600 MHz的速度运行（60个门延迟），并实现了最高229 GOPS或最高192 GOPS/W的性能，执行停顿不到2％。

0

相关内容

低延迟

【Manning新书】C++并行实战，592页pdf，C++ Concurrency in Action

【Manning新书】C++并行实战，592页pdf，C++ Concurrency in Action

专知会员服务

63+阅读 · 2021年1月16日

【MIT硬核新书】深度神经网络高效处理，82页pdf，Efficient Processing of DNN

【MIT硬核新书】深度神经网络高效处理，82页pdf，Efficient Processing of DNN

专知会员服务

129+阅读 · 2020年6月22日

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

专知会员服务

33+阅读 · 2020年4月1日

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

专知会员服务

15+阅读 · 2020年3月7日

【2020新书】算法与数据结构实战，286页pdf，Algorithms Data Structures in Action

【2020新书】算法与数据结构实战，286页pdf，Algorithms Data Structures in Action

专知会员服务

107+阅读 · 2020年2月22日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

【Google可解释人工智能白皮书】27页pdf，AI Explainability Whitepaper ，Introduction to AI Explanations for AI Platform

【Google可解释人工智能白皮书】27页pdf，AI Explainability Whitepaper ，Introduction to AI Explanations for AI Platform

专知会员服务

127+阅读 · 2019年12月13日

【Google】神经架构搜索（Neural Architecture Search and Beyond），Barret Zoph

【Google】神经架构搜索（Neural Architecture Search and Beyond），Barret Zoph

专知会员服务

31+阅读 · 2019年11月25日

【O'Reilly AI Conference 2019】高管简报：机器学习系统隐私的进步（Executive Briefing: Advances in privacy for machine learning systems），Katharine Jarmul

【O'Reilly AI Conference 2019】高管简报：机器学习系统隐私的进步（Executive Briefing: Advances in privacy for machine learning systems），Katharine Jarmul

专知会员服务

16+阅读 · 2019年11月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

我不想 MySQL 分片

我不想 MySQL 分片

InfoQ

0+阅读 · 2022年7月10日

过去5年，PolarDB云原生数据库是如何进行性能优化的？

过去5年，PolarDB云原生数据库是如何进行性能优化的？

阿里技术

0+阅读 · 2022年6月29日

使用 Keras Tuner 调节超参数

使用 Keras Tuner 调节超参数

TensorFlow

15+阅读 · 2020年2月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

最前沿的深度学习论文、架构及资源分享

最前沿的深度学习论文、架构及资源分享

深度学习与NLP

13+阅读 · 2018年1月25日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

Volterra积分微分方程的多区间Chebyshev和Legendre谱配置法

国家自然科学基金

0+阅读 · 2015年12月31日

针对GPU的高效并行任务执行设计研究

国家自然科学基金

0+阅读 · 2013年12月31日

异构云环境下能耗高效调度模型与优化方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

云环境下空间处理服务组合的即时自治愈研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于集群OFDM的低功耗电力线通信收发端设计

国家自然科学基金

0+阅读 · 2013年12月31日

基于PCE的多层多域光网络QoS组播路由多目标优化算法研究

国家自然科学基金

0+阅读 · 2012年12月31日

面向OFDM系统的协作网络编码传输策略及资源分配研究

国家自然科学基金

0+阅读 · 2012年12月31日

CPU Cache的功耗驱动设计方法及工具研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于Frenet标架曲率半径函数的涡旋型线构建理论与特性研究

国家自然科学基金

0+阅读 · 2009年12月31日

面向多核处理器的硬软件协作Transactional Memory系统结构

国家自然科学基金

0+阅读 · 2008年12月31日

Defending Against the Dark Arts: Recognising Dark Patterns in Social Media

Arxiv

0+阅读 · 2023年5月22日

PoNet: Pooling Network for Efficient Token Mixing in Long Sequences

Arxiv

0+阅读 · 2023年5月22日

Encoding physics to learn reaction-diffusion processes

Arxiv

0+阅读 · 2023年5月22日

Detection of Interacting Variables for Generalized Linear Models via Neural Networks

Arxiv

0+阅读 · 2023年5月21日

An Experimental Investigation of Tuning QUIC-Based Publish-Subscribe Architectures in IoT

Arxiv

0+阅读 · 2023年5月19日

TransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory Systems

Arxiv

0+阅读 · 2023年5月19日

Dynamic Term Structure Models with Nonlinearities using Gaussian Processes

Arxiv

0+阅读 · 2023年5月18日

AnalogNAS: A Neural Network Design Framework for Accurate Inference with Analog In-Memory Computing

Arxiv

0+阅读 · 2023年5月17日

A Survey of Machine Learning for Computer Architecture and Systems

Arxiv

18+阅读 · 2021年2月16日

Neural Architecture Search: A Survey

Arxiv

12+阅读 · 2018年9月5日

VIP会员

文章信息

相关主题

相关VIP内容

【Manning新书】C++并行实战，592页pdf，C++ Concurrency in Action

【Manning新书】C++并行实战，592页pdf，C++ Concurrency in Action

专知会员服务

63+阅读 · 2021年1月16日

【MIT硬核新书】深度神经网络高效处理，82页pdf，Efficient Processing of DNN

【MIT硬核新书】深度神经网络高效处理，82页pdf，Efficient Processing of DNN

专知会员服务

129+阅读 · 2020年6月22日

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

【阿里巴巴达摩院】TResNet: 高性能的GPU专用架构，GPU-Dedicated Architecture

专知会员服务

33+阅读 · 2020年4月1日

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

【SIGMOD2020-CMU】在内存中搜索树的顺序保持键压缩，Order-Preserving Key Compression for In-Memory Search Trees

专知会员服务

15+阅读 · 2020年3月7日

【2020新书】算法与数据结构实战，286页pdf，Algorithms Data Structures in Action

【2020新书】算法与数据结构实战，286页pdf，Algorithms Data Structures in Action

专知会员服务

107+阅读 · 2020年2月22日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

【Google可解释人工智能白皮书】27页pdf，AI Explainability Whitepaper ，Introduction to AI Explanations for AI Platform

【Google可解释人工智能白皮书】27页pdf，AI Explainability Whitepaper ，Introduction to AI Explanations for AI Platform

专知会员服务

127+阅读 · 2019年12月13日

【Google】神经架构搜索（Neural Architecture Search and Beyond），Barret Zoph

【Google】神经架构搜索（Neural Architecture Search and Beyond），Barret Zoph

专知会员服务

31+阅读 · 2019年11月25日

【O'Reilly AI Conference 2019】高管简报：机器学习系统隐私的进步（Executive Briefing: Advances in privacy for machine learning systems），Katharine Jarmul

【O'Reilly AI Conference 2019】高管简报：机器学习系统隐私的进步（Executive Briefing: Advances in privacy for machine learning systems），Katharine Jarmul

专知会员服务

16+阅读 · 2019年11月5日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日

热门VIP内容

开通专知VIP会员享更多权益服务

数据智能体综述：新兴范式还是被高估的炒作？

海底战已至：美国构思海底安全战略 | 最新报告

【ICCV2025教程】视觉异常检测中的基础模型：进展、挑战与应用

美军将无人自主等新技术融入潜艇部队以更具杀伤力

相关资讯

我不想 MySQL 分片

我不想 MySQL 分片

InfoQ

0+阅读 · 2022年7月10日

过去5年，PolarDB云原生数据库是如何进行性能优化的？

过去5年，PolarDB云原生数据库是如何进行性能优化的？

阿里技术

0+阅读 · 2022年6月29日

使用 Keras Tuner 调节超参数

使用 Keras Tuner 调节超参数

TensorFlow

15+阅读 · 2020年2月6日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

深度自进化聚类：Deep Self-Evolution Clustering

深度自进化聚类：Deep Self-Evolution Clustering

我爱读PAMI

15+阅读 · 2019年4月13日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

最前沿的深度学习论文、架构及资源分享

最前沿的深度学习论文、架构及资源分享

深度学习与NLP

13+阅读 · 2018年1月25日

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

ResNet, AlexNet, VGG, Inception：各种卷积网络架构的理解

全球人工智能

20+阅读 · 2017年12月17日

【论文】变分推断（Variational inference)的总结

【论文】变分推断（Variational inference)的总结

机器学习研究会

39+阅读 · 2017年11月16日

相关论文

Defending Against the Dark Arts: Recognising Dark Patterns in Social Media

Arxiv

0+阅读 · 2023年5月22日

PoNet: Pooling Network for Efficient Token Mixing in Long Sequences

Arxiv

0+阅读 · 2023年5月22日

Encoding physics to learn reaction-diffusion processes

Arxiv

0+阅读 · 2023年5月22日

Detection of Interacting Variables for Generalized Linear Models via Neural Networks

Arxiv

0+阅读 · 2023年5月21日

An Experimental Investigation of Tuning QUIC-Based Publish-Subscribe Architectures in IoT

Arxiv

0+阅读 · 2023年5月19日

TransPimLib: A Library for Efficient Transcendental Functions on Processing-in-Memory Systems

Arxiv

0+阅读 · 2023年5月19日

Dynamic Term Structure Models with Nonlinearities using Gaussian Processes

Arxiv

0+阅读 · 2023年5月18日

AnalogNAS: A Neural Network Design Framework for Accurate Inference with Analog In-Memory Computing

Arxiv

0+阅读 · 2023年5月17日

A Survey of Machine Learning for Computer Architecture and Systems

Arxiv

18+阅读 · 2021年2月16日

Neural Architecture Search: A Survey

Arxiv

12+阅读 · 2018年9月5日

相关基金

Volterra积分微分方程的多区间Chebyshev和Legendre谱配置法

国家自然科学基金

0+阅读 · 2015年12月31日

针对GPU的高效并行任务执行设计研究

国家自然科学基金

0+阅读 · 2013年12月31日

异构云环境下能耗高效调度模型与优化方法研究

国家自然科学基金

1+阅读 · 2013年12月31日

云环境下空间处理服务组合的即时自治愈研究

国家自然科学基金

0+阅读 · 2013年12月31日

基于集群OFDM的低功耗电力线通信收发端设计

国家自然科学基金

0+阅读 · 2013年12月31日

基于PCE的多层多域光网络QoS组播路由多目标优化算法研究

国家自然科学基金

0+阅读 · 2012年12月31日

面向OFDM系统的协作网络编码传输策略及资源分配研究

国家自然科学基金

0+阅读 · 2012年12月31日

CPU Cache的功耗驱动设计方法及工具研究

国家自然科学基金

0+阅读 · 2012年12月31日

基于Frenet标架曲率半径函数的涡旋型线构建理论与特性研究

国家自然科学基金

0+阅读 · 2009年12月31日

面向多核处理器的硬软件协作Transactional Memory系统结构

国家自然科学基金

0+阅读 · 2008年12月31日

微信扫码咨询专知VIP会员