Transformers' quadratic complexity with respect to the input sequence length has motivated a body of work on efficient sparse approximations to softmax. An alternative path, used by entmax transformers, consists of having built-in exact sparse attention; however this approach still requires quadratic computation. In this paper, we propose Sparsefinder, a simple model trained to identify the sparsity pattern of entmax attention before computing it. We experiment with three variants of our method, based on distances, quantization, and clustering, on two tasks: machine translation (attention in the decoder) and masked language modeling (encoder-only). Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph. This allows for detailed comparison between different models along their Pareto curves, important to guide future benchmarks for sparse attention models.
翻译:在输入序列长度方面,变形器的二次复杂程度促使人们在对软体的高效稀释近似值方面做了大量工作。 内轴变压器使用的替代路径包括内置零散的注意; 但这种方法仍然需要二次计算。 在本文中,我们提议了Sparsefinder, 这是一种简单模型, 用来在计算前识别内轴注意的广度模式。 我们根据距离、 定量化和集群, 试验了我们方法的三种变体, 分别是两件任务: 机器翻译( 在解码器中保存) 和遮蔽语言建模( 以编码为主 ) 。 我们的工作提供了一个新角度来研究模型效率, 方法是广泛分析广度和回顾预测的注意图之间的取舍。 这使得不同的模型能够沿着其帕雷托曲线进行详细比较, 这对于指导稀有注意模型的未来基准非常重要 。