闪电注意:快速和内存-高效实时注意,有IO-意识 (FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness)

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware -- accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention to block-sparse attention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention trains Transformers faster than existing baselines: 15% end-to-end wall-clock speedup on BERT-large (seq. length 512) compared to the MLPerf 1.1 training speed record, 3$\times$ speedup on GPT-2 (seq. length 1K), and 2.4$\times$ speedup on long-range arena (seq. length 1K-4K). FlashAttention and block-sparse FlashAttention enable longer context in Transformers, yielding higher quality models (0.7 better perplexity on GPT-2 and 6.4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61.4% accuracy) and Path-256 (seq. length 64K, 63.1% accuracy).

翻译：64 变异器在长序列上是缓慢的,而记忆-饥饿是长序列上是缓慢的,因为自我注意的时间和记忆复杂性在序列长度上是四倍的。近距离关注方法试图通过交换模型质量来解决这一问题,以减少计算复杂性,但往往没有实现倒时钟加速。我们争论说, 缺少的原则是关注算法 IO- 觉 -- 计算读数和写数在 GPU 记忆级别之间的值。我们提议了 FlashAtention, 一种IO- 觉觉察精确的注意算法, 用来降低GPU高带内存(HBMB) 和 GPUT Streal Right SRA 之间的记忆读数( HBM) 。我们分析了 IMO 复杂性, 显示它需要比标准关注范围少的 HBBMBM- 2- TL) 递增速度( K- Trightal- moreal- deal- deal- deal- deal- deal- dreal- dreal lax) 16 K- dreal- dreal- dreal- dreal- dreal- dreal- dreax 和在1 K- dreal- drealdalx 和1 K- dreal- dreald- drealx 和1 Kxx 上, 在1 K- devalx 上, 和1 K- slock- devalxxxx 上, 16- devit- devalx 和1 Kx 和1 Kx 和1x 上, 和1 K- deval-l-l-l-l-l-l-l-l-l-l-l-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 上, 上, 上, 16-l-l-lxxxxxxxxxxxxxxxxxx, 16-lx, 16-l-l-l-l-l-l

相关内容

注意力机制

关注 120

Attention机制最早是在视觉图像领域提出来的，但是真正火起来应该算是google mind团队的这篇论文《Recurrent Models of Visual Attention》[14]，他们在RNN模型上使用了attention机制来进行图像分类。随后，Bahdanau等人在论文《Neural Machine Translation by Jointly Learning to Align and Translate》 [1]中，使用类似attention的机制在机器翻译任务上将翻译和对齐同时进行，他们的工作算是是第一个提出attention机制应用到NLP领域中。接着类似的基于attention机制的RNN模型扩展开始应用到各种NLP任务中。最近，如何在CNN中使用attention机制也成为了大家的研究热点。下图表示了attention研究进展的大概趋势。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Deep Learning Based Detection and Correction of Cardiac MR Motion Artefacts During Reconstruction for High-Quality Segmentation

专知会员服务

59+阅读 · 2019年10月17日