Attention mechanisms form the backbone of state-of-the-art machine learning models for a variety of tasks. Deploying them on deep neural network (DNN) accelerators, however, is prohibitively challenging especially under long sequences, as this work identifies. This is due to operators in attention layers exhibiting limited reuse opportunities and quadratic growth in memory footprint, leading to severe memory-boundedness. To address this, we introduce a new attention-tailored dataflow, termed FLAT, which identifies fusion opportunities within the attention layer, and implements an on-chip memory-aware interleaved execution and tiling mechanism. FLAT increases the effective memory bandwidth by efficiently utilizing the high-bandwidth, low-capacity on-chip buffer and thus achieves better run time and compute resource utilization. In our evaluation, FLAT achieves 1.94x and 1.76x speedup and 49% and 42% of energy reduction comparing to baseline execution over state-of-the-art edge and cloud accelerators.
翻译:关注机制是各种任务最先进的机器学习模型的骨干。 但是,在深神经网络(DNN)加速器(DNN)加速器上部署它们,尤其如本项工作所查明的那样,在长序列下具有令人望而生畏的挑战性。这是由于关注层的操作者在记忆足迹方面表现出有限的再利用机会和二次增长,从而导致严重的内存限制。为了解决这个问题,我们引入了新的关注量数据流,称为FLAT,它查明了关注层内的聚变机会,并实施了在芯片记忆中注意到的跨左执行和打字机制。 FLAT通过高效地利用高带宽、低容量的芯缓冲器来增加有效的记忆带宽度,从而实现更好的运行时间和计算资源利用。 在我们的评估中,FLAT实现了1.94x和1.76x速度,以及49%和42%的能源减少量,这与对状态边缘和云加速器的基线执行相比较。