Attention mechanisms, primarily designed to capture pairwise correlations between words, have become the backbone of machine learning, expanding beyond natural language processing into other domains. This growth in adaptation comes at the cost of prohibitively large memory requirements and computational complexity, especially at higher number of input elements. This limitation is due to inherently limited data reuse opportunities and quadratic growth in memory footprints, leading to severe memory-boundedness and limited scalability of input elements. This work addresses these challenges by devising a tailored dataflow optimization, called FLAT, for attention mechanisms without altering their functionality. This dataflow processes costly attention operations through a unique fusion mechanism, transforming the memory footprint quadratic growth to merely a linear one. To realize the full potential of this bespoke mechanism, we propose a tiling approach to enhance the data reuse across attention operations. Our method both mitigates the off-chip bandwidth bottleneck as well as reduces the on-chip memory requirement. FLAT delivers 1.94x (1.76x) speedup and 49% and (42%) of energy savings compared to the state-of-the-art Edge (Cloud) accelerators with no customized dataflow optimization. When on-chip resources are scarce (20 KB-200 KB), FLAT yields, on average, 1.5x end-to-end latency reduction across a diverse range of conventional attention-based models with input sequence lengths ranging from 512-token to 64K-token. Our evaluations demonstrate that state-of-the-art DNN dataflow applied to attention operations reach the efficiency limit for inputs above 512 elements. In contrast, FLAT unblocks transformer models for inputs with up to 64K elements
翻译:关注机制,主要是为了捕捉言词之间的对等关系,已经成为机器学习的主干,超越自然语言处理,扩展到其他领域。适应的增加是以惊人的庞大记忆要求和计算复杂性的代价,特别是投入元素数量较多。这一限制是由于数据再利用机会内在有限,记忆足迹四度增长,导致严重的记忆限制和输入元素缩缩放。这项工作通过设计定制的数据流优化(称为FLAT)来应对这些挑战,用于不改变功能的注意机制。这一数据流通过一个独特的聚合机制处理昂贵的注意力操作,将记忆足部二次增长转变为仅仅是线性增长。为了实现这一表达机制的全部潜力,我们建议采取平铺式方法,提高数据在关注操作中的再利用。我们的方法既可以缓解离子带宽带宽的瓶颈,也可以降低在芯片内存储要求。FLAT为1.94x(1.76x)的快速增长和49%的节能节流,而与电量的电量评估(Clooverate State State)相比,将存储足足部的平级增长值增长(KLAxx) 数据流将数据从5Klistrax 递减为数据。