FLAT: 用于缓解注意力瓶颈的优化数据流 (FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks)

Attention mechanisms, primarily designed to capture pairwise correlations between words, have become the backbone of machine learning, expanding beyond natural language processing into other domains. This growth in adaptation comes at the cost of prohibitively large memory requirements and computational complexity, especially at higher number of input elements. This limitation is due to inherently limited data reuse opportunities and quadratic growth in memory footprints, leading to severe memory-boundedness and limited scalability of input elements. This work addresses these challenges by devising a tailored dataflow optimization, called FLAT, for attention mechanisms without altering their functionality. This dataflow processes costly attention operations through a unique fusion mechanism, transforming the memory footprint quadratic growth to merely a linear one. To realize the full potential of this bespoke mechanism, we propose a tiling approach to enhance the data reuse across attention operations. Our method both mitigates the off-chip bandwidth bottleneck as well as reduces the on-chip memory requirement. Across a diverse range of models, FLAT delivers 1.94x (1.76x) speedup and 49% and (42%) of energy savings compared to the state-of-the-art edge (cloud) accelerators with no customized dataflow optimization. Our evaluations demonstrate that state-of-the-art DNN dataflows applied to attention operations reach the efficiency limit for inputs above 512 elements. In contrast, FLAT unblocks transformer models for inputs with up to 64 K elements in edge and cloud accelerators.

翻译：关注机制主要是为了捕捉言词之间的对等关系,它已经成为机器学习的主干,超越自然语言处理,扩展到其他领域。适应的增加是以惊人的庞大记忆要求和计算复杂性的代价,特别是投入元素数量较多的代价。这一限制是由于数据再利用机会内在有限,记忆足迹的二次增长,导致严重的内存限制和输入元素的可缩缩放性。这项工作通过设计一个定制的数据流优化(称为FLAT),在不改变功能的情况下用于关注边缘机制,来应对这些挑战。这一数据流通过一个独特的聚合机制处理昂贵的注意力操作,将记忆足部二次增长转变为仅仅是直线性机制。为了实现这一表达机制的全部潜力,我们建议采用一种平铺式方法,以加强注意力作业之间的数据再利用。我们的方法既可以减轻机外带带宽度的带宽度,又可以减少输入元素在芯片上的记忆要求。在多种模型中,FLAT提供1.94x(1.76x)加速和49 %以及(42%)节能节约操作,而比Sn-Artime 平面的模型,可以显示我们Flock-tradeal 数据流到Flal 5 数据流。

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

ICLR 2022杰出论文公布：7篇论文获得，清华朱军课题组摘得

专知会员服务

60+阅读 · 2022年4月22日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

ICLR 2021杰出论文奖出炉，8篇论文上榜！

专知会员服务

26+阅读 · 2021年4月2日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

49+阅读 · 2019年10月17日