We introduce Monte-Carlo Attention (MCA), a randomized approximation method for reducing the computational cost of self-attention mechanisms in Transformer architectures. MCA exploits the fact that the importance of each token in an input sequence varies with respect to their attention scores; thus, some degree of error can be tolerable when encoding tokens with low attention. Using approximate matrix multiplication, MCA applies different error bounds to encode input tokens such that those with low attention scores are computed with relaxed precision, whereas errors of salient elements are minimized. MCA can operate in parallel with other attention optimization schemes and does not require model modification. We study the theoretical error bounds and demonstrate that MCA reduces attention complexity (in FLOPS) for various Transformer models by up to 11$\times$ in GLUE benchmarks without compromising model accuracy.
翻译:我们引入了Monte-Carlo Recent(MCA),这是用来降低变异器结构中自留机制计算成本的随机近似法。 MCA利用了一个事实,即输入序列中每个符号的重要性随其注意分数的不同而不同;因此,当注意度低的编码符号时,可以容忍某种程度的错误。使用粗略的矩阵乘法, MCA对输入符号的编码使用不同的错误界限,这样那些关注分数低的符号可以以宽松的精确度计算,而突出元素的错误则可以最小化。 MCA可以与其他关注优化计划同时运作,而不需要模型修改。我们研究了理论错误的界限,并表明MCA将各种变异模型的注意复杂性(在FLOPS中)在GLUE基准中减少高达11美元的时间,而不会损害模型的精确性。