The attention mechanisms of transformers effectively extract pertinent information from the input sequence. However, the quadratic complexity of self-attention w.r.t the sequence length incurs heavy computational and memory burdens, especially for tasks with long sequences. Existing accelerators face performance degradation in these tasks. To this end, we propose SALO to enable hybrid sparse attention mechanisms for long sequences. SALO contains a data scheduler to map hybrid sparse attention patterns onto hardware and a spatial accelerator to perform the efficient attention computation. We show that SALO achieves 17.66x and 89.33x speedup on average compared to GPU and CPU implementations, respectively, on typical workloads, i.e., Longformer and ViL.
翻译:变压器的注意机制有效地从输入序列中提取了相关信息。然而,自动注意序列长度的四重复杂度给计算和内存造成了沉重的负担,特别是对长序列的任务而言。现有的加速器在这些任务中面临性能退化。为此,我们提议SALO使混合的分散注意机制能够用于长序列。SALO包含一个数据调度器,将混合分散注意模式绘制到硬件和一个空间加速器上,以进行高效注意计算。我们显示,SALO与GPU和CPU分别在典型工作量(即长线和VIL)上分别实现了17.66x89.33x的平均速度和89.33x速度。