Transformers are the mainstream of NLP applications and are becoming increasingly popular in other domains such as Computer Vision. Despite the improvements in model quality, the enormous computation costs make Transformers difficult at deployment, especially when the sequence length is large in emerging applications. Processing attention mechanism as the essential component of Transformer is the bottleneck of execution due to the quadratic complexity. Prior art explores sparse patterns in attention to support long sequence modeling, but those pieces of work are on static or fixed patterns. We demonstrate that the sparse patterns are dynamic, depending on input sequences. Thus, we propose the Dynamic Sparse Attention (DSA) that can efficiently exploit the dynamic sparsity in the attention of Transformers. Compared with other methods, our approach can achieve better trade-offs between accuracy and model complexity. Moving forward, we identify challenges and provide solutions to implement DSA on existing hardware (GPUs) and specialized hardware in order to achieve practical speedup and efficiency improvements for Transformer execution.
翻译:变异器是NLP应用程序的主流,在计算机视野等其他领域越来越受欢迎。尽管模型质量有所改善,但巨大的计算成本使得变异器难以部署,特别是在新应用中序列长度较大的情况下。 变异器的基本组成部分是四面形复杂性造成的执行瓶颈。 先前的艺术探索了稀疏的模式,以支持长序列建模,但这些作品是静态或固定的。 我们证明,稀疏的模式是动态的,取决于输入序列。 因此,我们提出动态微缩注意(DSA), 能够有效地利用变异器注意力中的动态聚变。 与其他方法相比, 我们的方法可以更好地在精度和模型复杂性之间实现权衡。 向前看,我们找出挑战,并提供解决方案,在现有硬件和专门硬件上实施DSA(GPUs)和专门硬件,以便实现变异器执行的实际加速和效率改进。